SlideShare a Scribd company logo
Writing OpenMP Programs on
Many and Multi Core Machines
Prof NB Venkateswarlu
ISTE Visiting Professor 2010-11
CSE, AITAM, Tekkali
venkat_ritch@yahoo.com
www.ritchcenter.com/nbv
Agenda
• Why OpenMP ?
• Elements of OpenMP
• Scalable Speedup and Data Locality
• Parallelizing Sequential Programs
• Breaking data dependencies
• Avoiding synchronization overheads
• Achieving Cache and Page Locality
• SGI Tools for Performance Analysis
and Tuning
Why OpenMP ?
• Parallel programming is more difficult
than sequential programming
• OpenMP is a scalable, portable,
incremental approach to designing
shared memory parallel programs
• OpenMP supports
– fine and coarse grained parallelism
– data and control parallelism
What is OpenMP ?
Three components:
• Set of compiler directives for
– creating teams of threads
– sharing the work among threads
– synchronizing the threads
• Library routines for setting and querying
thread attributes
• Environment variables for controlling run-
time behavior of the parallel program
Elements of OpenMP
• Parallel regions and work sharing
• Data scoping
• Synchronization
• Compiling and running OpenMP
programs
Parallelism in OpenMP
• The parallel region is the construct for
creating multiple threads in an
OpenMP program
• A team of threads is created at run
time for a parallel region
• A nested parallel region is allowed,
but may contain a team of one thread
• Nested parallelism is enabled with
setenv OMP_NESTED TRUE
Parallelism in OpenMP
beginning of
parallel region
fork
join
fork
join
end of parallel
region
Hello World in OpenMP
#include <omp.h>
int main() {
int iam =0, np = 1;
#pragma omp parallel private(iam, np)
{
#if defined (_OPENMP)
np = omp_get_num_threads();
iam = omp_get_thread_num();
#endif
printf(“Hello from thread %d out of %d n”, iam, np);
}
}
parallel region directive
with data scoping clause
Specifying Parallel Regions
• Fortran !
$OMP PARALLEL [clause [clause…]]
! Block
of code executed by all threads
!$OMP END PARALLEL
• C and C++
#pragma omp parallel [clause [clause...]]
{
/* Block executed by all threads */
Work sharing in OpenMP
• Two ways to specify parallel work:
– Explicitly coded in parallel regions
– Work-sharing constructs
» DO and for constructs: parallel loops
» sections
» single
• SPMD type of parallelism supported
Work and Data Partitioning
Loop parallelization
• distribute the work among the threads,
without explicitly distributing the data.
• scheduling determines which thread
accesses which data
• communication between threads is
implicit, through data sharing
• synchronization via parallel constructs
or is explicitly inserted in the code
Data Partitioning & SPMD
• Data is distributed explicitly among
processes
• With message passing, e.g., MPI,
where no data is shared, data is
explicitly communicated
• Synchronization is explicit or
embedded in communication
• With parallel regions in OpenMP, both
SPMD and data sharing are supported
Pros and Cons of SPMD
» Potentially higher parallel fraction
than with loop parallelism
» The fewer parallel regions, the less
overhead
» More explicit synchronization needed
than for loop parallelization
» Does not promote incremental
parallelization and requires manually
assigning data subsets to threads
SPMD Example
program mat_init
implicit none
integer, parameter::N=1024
real A(N,N)
integer :: iam, np
iam = 0
np = 1
!$omp parallel private(iam,np)
np = omp_get_num_threads()
iam = omp_get_thread_num()
! Each thread calls work
call work(N, A, iam, np)
!$omp end parallel
end
subroutine work(n, A, iam, np)
integer n, iam, n
real A(n,n)
integer :: chunk,low,high,i,j
chunk = (n + np - 1)/np
low = 1 + iam*chunk
high=min(n,(iam+1)*chunk)
do j = low, high
do I=1,n
A(I,j)=3.14 + &
sqrt(real(i*i*i+j*j+i*j*j))
enddo
enddo
return
A single parallel region, no scheduling needed,
each thread explicitly determines its work
Extent of directives
Most directives have as extent a
structured block, or basic block, i.e., a
sequence of statements with a flow of
control that satisfies:
• there is only one entry point in the
block, at the beginning of the block
• there is only one exit point, at the end
of the block; the exceptions are that
exit() in C and stop in Fortran are
allowed
Work Sharing Constructs
• DO and for : parallelizes a loop,
dividing the iterations among the
threads
• sections : defines a sequence of
contiguous blocks, the beginning of
each bock being marked by a section
directive. The block within each
section is assigned to one thread
• single: assigns a block of a parallel
region to a single thread
Specialized Parallel
Regions
Work-sharing can be specified
combined with a parallel region
• parallel DO and parallel for : a
parallel region which contains a
parallel loop
• parallel sections, a parallel region
that contains a number of section
constructs
Scheduling
• Scheduling assigns the iterations of a
parallel loop to the team threads
• The directives [parallel] do and
[parallel] for take the clause
schedule(type [,chunk])
• The optional chunk is a loop-invariant
positive integer specifying the number
of contiguous iterations assigned to a
thread
Scheduling
The type can be one of
• static threads are statically assigned
chunks of size chunk in a round-robin
fashion. The default for chunk is
ceiling(N/p) where N is the number of
iterations and p is the number of
processors
• dynamic threads are dynamically
assigned chunks of size chunk, i.e.,
Scheduling
when a thread is ready to receive new
work, it is assigned the next pending
chunk. Default value for chunk is 1.
• guided a variant of dynamic
scheduling in which the size of the
chunk decreases exponentially from
chunk to 1. Default value for chunk is
ceiling(N/p)
Scheduling
• runtime indicates that the schedule
type and chunk are specified by the
environment variable OMP_SCHEDULE. A
chunk cannot be specified with
runtime.
• Example of run-time specified
scheduling
setenv OMP_SCHEDULE “dynamic,2”
Scheduling
• If the schedule clause is missing, an
implementation dependent schedule is
selected. MIPSpro selects by default
the static schedule
• Static scheduling has low overhead
and provides better data locality
• Dynamic and guided scheduling may
provide better load balancing
Work Sharing Constructs
A motivating example
for(i=0;I<N;i++) { a[i] = a[i] + b[i];}
#pragma omp parallel
{
int id, i, Nthrds, istart, iend;
id = omp_get_thread_num();
Nthrds = omp_get_num_threads();
istart = id * N / Nthrds;
iend = (id+1) * N / Nthrds;
for(i=istart;I<iend;i++) {a[i]=a[i]+b[i];}
}
#pragma omp parallel
#pragma omp for schedule(static)
for(i=0;I<N;i++) { a[i]=a[i]+b[i];}
OpenMP
parallel region
and a work-
Sequential
code
OpenMP
Parallel Region
OpenMP Parallel
Region and a
work-sharing for
construct
Work-sharing ConstructWork-sharing Construct
 Threads are assigned an
independent set of
iterations
 Threads must wait at the
end of work-sharing
construct
#pragma omp parallel
#pragma omp for
Implicit barrier
i = 1
i = 2
i = 3
i = 4
i = 5
i = 6
i = 7
i = 8
i = 9
i = 10
i = 11
i = 12
#pragma omp parallel
#pragma omp for
for(i = 1, i < 13, i++)
c[i] = a[i] + b[i]
Combining pragmasCombining pragmas
 These two code segments are equivalent
#pragma omp parallel
{
#pragma omp for
for (i=0;i< MAX; i++)
{ res[i]
= huge();
}
}
#pragma omp parallel for
for (i=0;i< MAX; i++) {
res[i] = huge();
}
Types of Extents
Two types for the extent of a directive:
• static or lexical extent: the code
textually enclosed between the
beginning and the end of the
structured block following the
directive
• dynamic extent: static extent as well
as the procedures called from within
the static extent
Orphaned Directives
A directive which is in the dynamic
extent of another directive but not in
its static extent is said to be orphaned
• Work sharing directives can be
orphaned
• This allows a work-sharing construct
to occur in a subroutine which can be
called both by serial and parallel code,
improving modularity
Directive Binding
• Work sharing directives (do, for,
sections, and single) as well as
master and barrier bind to the
dynamically closest parallel directive,
if one exists, and have no effect when
they are not in the dynamic extent of a
parallel region
• The ordered directive binds to the
enclosing do or for directive having
the ordered clause
Directive Binding
• critical (and atomic) provide
mutual exclusive execution (and
update) with respect to all the
threads in the program
Data Scoping
• Work-sharing and parallel
directives accept data scoping clauses
• Scope clauses apply to the static
extent of the directive and to variables
passed as actual arguments
• The shared clause applied to a
variable means that all threads will
access the single copy of that variable
created in the master thread
Data Scoping
• The private clause applied to a
variable means that a volatile copy of
the variable is cloned for each thread
• Semi-private data for parallel loops:
– reduction: variable that is the target of a reduction
operation performed by the loop, e.g., sum
– firstprivate: initialize the private copy from the
value of the shared variable
– lastprivate: upon loop exit, master thread holds
the value seen by the thread assigned the last loop
iteration
Threadprivate Data
• The threadprivate directive is
associated with the declaration of a
static variable (C) or common block
(Fortran) and specifies persistent data
(spans parallel regions) cloned, but
not initialized, for each thread
• To guarantee persistence, the dynamic
threads feature must be disabled
setenv OMP_DYNAMIC FALSE
Threadprivate Data
• threadprivate data can be initialized
in a thread using the copyin clause
associated with the parallel,
parallel do/for, and parallel
sections directives
• the value stored in the master thread
is copied into each team thread
• Syntax: copyin (name [,name])
where name is a variable or (in Fortran)
a named common block
Scoping Rules
• Data declared outside a parallel region
region is shared by default, except for
– loop index variable of parallel do
– data declared as threadprivate
• Local data in the dynamic extent of a
parallel region is private:
– subroutine local variables, and
– C/C++ blocks within a parallel region
Scoping Restrictions
• The private clause for a directive in
the dynamic extent of a parallel region
can be specified only for variables that
are shared in the enclosing parallel
region
– That is, a privatized variable cannot
be privatized again
• The shared clause is not allowed for
the DO (Fortran) or for (C) directive
Shared Data
• Access to shared data must be
mutually exclusive: a thread at a time
• For shared arrays, when different
threads access mutually exclusive
subscripts, synchronization is not
needed
• For shared scalars, critical sections or
atomic updates must be used
• Consistency operation: flush directive
Synchronization
Explicit, via directives:
• critical, implements the critical
sections, providing mutual exclusion
• atomic, implements atomic update of
a shared variable
• barrier, a thread waits at the point
where the directive is placed until all
other threads reach the barrier
Synchronization
• ordered, preserves the order of the
sequential execution; can occur at
most once inside a parallel loop
• flush, creates consistent view of
thread-visible data
• master, block in a parallel region that
is executed by the master thread and
skipped by the other threads; unlike
single, there is no implied barrier
Implicit Synchronization
• There is an implied barrier at the
end of a parallel region, and of a work-
sharing construct for which a nowait
clause is not specified
• A flush is implied by an explicit or
implicit barrier as well as upon
entry and exit of a critical or
ordered block
Directive Nesting
• A parallel directive can appear in
the dynamic extent of another
parallel, i.e., parallel regions can be
nested
• Work-sharing directives binding to the
same parallel directive cannot be
nested
• An ordered directive cannot appear
in the dynamic extent of a critical
directive
Directive Nesting
• A barrier or master directive
cannot appear in the dynamic extent of
a work-sharing region ( DO or for,
sections, and single) or ordered
block
• In addition, a barrier directive cannot
appear in the dynamic extent of a
critical or master block
Environment Variables
Name Value
OMP_NUM_THREADS positive number
OMP_DYNAMIC TRUE or FALSE
OMP_NESTED TRUE or FALSE
OMP_SCHEDULE “static,2”
• Online help: man openmp
Library Routines
OpenMP defines library routines that
can be divided in three categories
1. Query and set multithreading
• get/set number of threads or processors
omp_set_num_threads,
omp_get_num_threads,
omp_in_parallel, …
• get thread ID:
omp_get_thread_num
Library Routines
2. Set and get execution environment
• Inquire/set nested parallelism:
omp_get_nested
omp_set_nested
• Inquire/set dynamic number of threads in
different parallel regions:
omp_set_dynamic
omp_get_dynamic
Library Routines
3. API for manipulating locks
• A lock variable provides thread
synchronization, has C type omp_lock_t
and Fortran type integer*8, and holds
a 64-bit address
• Locking routines: omp_init_lock,
omp_set_lock,omp_unset_lock...
Man pages: omp_threads, omp_lock
Reality Check
Irregular and ambiguous aspects are
sources of language- and
implementation dependent behavior:
• nowait clause is allowed at the
beginning of [parallel] for (C/C+
+) but at the end of [parallel] DO
(Fortran)
• default clause can specify private
scope in Fortran, but not in C/C++
Reality Check
• Can only privatize full objects, not
array elements, or fields of data
structures
• For a threadprivate variable or
block one cannot specify any clause
except for the copyin clause
• In MIPSpro 7.3.1 one cannot specify in
the same directive both the
firstprivate and lastprivate
clauses for a variable
Reality Check
• With MIPSpro 7.3.1, when a loop is
parallelized with the do (Fortran) or for
(C/C++) directive, the indexes of the nested
loops are, by default, private in Fortran, but
shared in C/C++
Probably, this is a compiler issue
• Fortunately, the compiler warns about
unsynchronized accesses to shared
variables
• This does not occur for parallel do or
parallel for
Compiling and Running
• Use MIPSpro with the option -mp both for
compiling and linking
default -MP:open_mp=ON must be in effect
• Fortran:
f90 [-freeform] [-cpp]-mp prog.f
-freeform needed for free form source
-cpp needed when using #ifdef s
• C/C++:
cc -mp -O3 prog.c
CC -mp -O3 prog.C
Setting the Number of Threads
• Environment variables:
setenv OMP_NUM_THREADS 8
if OMP_NUM_THREADS is not set, but
MP_SET_NUMTHREADS is set, the latter
defines the number of threads
• Environment variables can be
overridden by the programmer:
omp_set_num_threads(int n)
Scalable Speedup
• Most often the memory is the limit to
the performance of a shared memory
program
• On scalable architectures, the latency
and bandwidth of memory accesses
depend on the locality of accesses
• In achieving good speedup of a shared
memory program, data locality is an
essential element
What Determines Data
Locality
• Initial data distribution determines on
which node the memory is placed
– first touch or round-robin system policies
– data distribution directives
– explicit page placement
• Work sharing, e.g., loop scheduling,
determines which thread accesses
which data
• Cache friendliness determines how
often main memory is accessed
Cache Friendliness
For both serial loops and parallel loops
• locality of references
– spatial locality: use adjacent cache lines and all
items in a cache line
– temporal locality: reuse same cache line; may
employ techniques such as cache blocking
• low cache contention
– avoid the sharing of cache lines among different
objects; may resort to array padding or increasing
the rank of an array
Cache Friendliness
• Contention is an issue specific to
parallel loops, e.g., false sharing of
cache lines
cache friendliness =
high locality of references
+
low contention
NUMA machines
• Memory hierarchies exist in single-CPU
computers and Symmetric
Multiprocessors (SMPs)
• Distributed shared memory (DSM)
machines based on Non-Uniform
Memory Architecture (NUMA) add
levels to the hierarchy:
– local memory has low latency
– remote memory has high latency
Origin2000 memory
hierarchy
Level Latency (cycles)
register 0
primary cache 2..3
secondary cache 8..10
local main memory & TLB hit 75
remote main memory & TLB hit 250
main memory & TLB miss 2000
page fault 10^6
Page Level Locality
• An ideal application has full page
locality: pages accessed by a
processor are on the same node as the
processor, and no page is accessed by
more than one processor (no page
sharing)
• Twofold benefit:
» low memory latency
» scalability of memory bandwidth
Page Level Locality
• The benefits brought about by page
locality are more important for
programs that are not cache friendly
• We look at several data placement
strategies for improving page locality
» system based placement
» data initialization and directives
» combination of system and program
directed data placement
Page sharing due to alignment
array section accessed
by processor 2
array section accessed
by processor 1
• page 1
page 1 page 2 page 3
• Consider an array whose size is twice the size
of a page, and which is distributed between two
nodes
• Page 1 and page 2 are located one node 1,
page 3 is on node 2
• Page 2 is shared by the two processors, due to
the array not starting on a page boundary
array layout
Achieving Page Locality
IRIX has two page placement policies:
• first-touch: the process which first
references a virtual address causes that
address to be mapped to a page on the
node where the process runs
• round-robin: pages allocated to a job are
selected from nodes traversed in round-
robin order
• IRIX uses first-touch, unless
setenv _DSM_ROUND_ROBIN
Achieving Page Locality
IRIX allows to migrate pages between
nodes, to adjust the page placement
• a page is migrated based on the affinity of
data accesses to that page, which is
derived at run-time from the per-process
cache-miss pattern
• page migration follows the page affinity
with a delay whose magnitude depends
on the aggressiveness of migration
Achieving Page Locality
• To enable data migration, except for
explicitly placed data
setenv _DSM_MIGRATION ON
• To enable migration of all data
setenv _DSM_MIGRATION ALL_ON
• To set the aggressiveness of migration
setenv _DSM_MIGRATION_LEVEL n
where n is an integer between 0 (least
aggressive, disables migration) and 100 (most
aggressive, the default)
Achieving Page Locality
Methods, from best to worst
• Parallel data initialization, using
OpenMP parallel work constructs such
as parallel do, combined with
operating system’s first-touch
placement policy
» works with heap, local, global arrays
» no data distribution directives needed
» can be used with page migration
Achieving Page Locality
• IRIX round-robin page placement
» improves application’s memory
bandwidth
» no change of code needed
» allows both serial and parallel
initialization of data in the program
Achieving Page Locality
• Regular distribution directive
» allows serial initialization of data
» data has same layout as in a serial
program
» page granularity of distribution
» cannot distribute heap allocated and
assumed-size arrays
Achieving Page Locality
• Page Migration:
» makes initial data placement less
important, e.g., allows sequential data
initialization
» improves locality of a computation
whose data access pattern changes
during the computations
» it is useful for programs that have
stable affinity for long time intervals
Achieving Page Locality
» page migration can be combined
with other techniques such as first-
touch or round-robin
» page migration is expensive
» page migration implements CPU
affinity with a delay
Reshaped Distribution
• Reshaped distribution directive
» no page granularity limitation
» data layout is most likely different
from the layout in a serial program
» code bloating: each routine that is
passed a reshaped parameter must
have a version specialized for handling
reshaped arrays
– layout is different for reshaped arrays
» cannot reshape initialized data, heap
allocated and assumed-size arrays
» overhead of indirect addressing
» side effect: a global structure or
Fortran common block that contains a
reshaped array cannot be declared
threadprivate, and cannot be
localized with the -Wl,-Xlocal option
Reshaped Distribution
• Regular distribution
!$SGI distribute a(d1[,d2])
[onto(p1[,p2])]
#pragma distribute a[d1][[d2]] [onto(p1[,p2])]
• Reshaped distribution
!$SGI distribute_reshape a(d1[,d2]) [onto(p1[,p2])]
#pragma distribute_reshape a[d1][[d2]]
• Distribution methods are denoted by d1, d2
• Optional clause onto specifies a processor
grid n1 x n2 , such that n1/n2 = p1/p2
SGI Data Placement Directives
Three distribution methods
* means no distribution along the
direction in which it appears
block distributes the elements of an
array in p contiguous chunks of size
ceiling(N/p), where N is the extent in
the distributed direction and p is the
number of processors
Distribution Specification
cyclic(k) distributes the elements of
an array in chunks of size k in a round-
robin fashion, i.e.,
– first processor is assigned array elements 1, K+1,
2*k+1,... (in Fortran) or 0, k, 2k,.. (in C)
– second processor is assigned elements 2, K+2,…
(in Fortran) or 1,K+1,2*K+1 (in C/C++)
Interleaved distribution is obtained for
k=1 (the default) and block-cyclic
distribution for k>1
Distribution Specification
For regular distribution, one should
distribute the outer dimension of an
array, to minimize the effect of page
granularity:
• Distribute columns in Fortran
!$SGI distribute A(*, block)
• Distribute rows in C/C++
#pragma distribute a(block,*)
Regular Distribution Tip
• Assumed size is not allowed for array
formal parameters which are declared
as distributed
• Specify array size:
void foo(int n, double a[n])
{
#pragma distribute_reshape a(block)
…
}
Distributed Arrays as
Formal Parameters
If a reshaped array is declared as
threadprivate, the compiler will
silently ignore the threadprivate
directive
» threadprivate is quietly ignored:
double a[n]
#pragma omp threadprivate(a)
#pragma distribute_reshape a(block)
Reshaped Array Pitfall
Parallelizing Code
• Optimize single-CPU performance
– maximize cache reuse
– eliminate cache misses
– compiler flags: -LNO:cache_size2=4m
-OPT:IEEE_arithmetic=3 -Ofast=ip27
• Parallelize as high a fraction of the
work as possible
– preserve cache friendliness
Parallelizing Code
– avoid synchronization and scheduling
overhead: partition in few parallel regions,
avoid reduction, single and critical
sections, make the code loop fusion
friendly, use static scheduling
– partition work to achieve load balancing
• Check correctness of parallel code
– run OpenMP compiled code first on one
thread, then on several threads
Synchronization Overhead
• Parallel regions, work-sharing, and
synchronization incur overhead
• Edinburgh OpenMP Microbenchmarks,
version 1.0, by J. Mark Bull, are used
to measure the cost of synchronization
on a 32 processor Origin 2000, with
300 MHz R12000 processors, and
compiling the benchmarks with
MIPSpro Fortran 90 compiler, version
7. 3.1.1m
Synchronization Overhead
Synchronization Overhead
Insights
• cost (DO) ~ cost(barrier)
• cost (parallel DO) ~ 2 * cost(barrier)
• cost (parallel) > cost (parallel DO)
• atomic is less expensive than critical
• bad scalability for
– reduction
– mutual exclusion: critical, (un)lock
– single
Loop Parallelization
• Identify the loops that are bottleneck to
performance
• Parallelize the loops, and ensure that
– no data races are created
– cache friendliness is preserved
– page locality is achieved
– synchronization and scheduling
overheads are minimized
Hurdles to Loop
Parallelization
• Data dependencies among iterations
caused by shared variables
• Input/Output operations inside the loop
• Calls to thread-unsafe code, e.g., the
intrinsic function rtc
• Branches out of the loop
• Insufficient work in the loop body
• The MIPSpro auto-parallelizer helps in
identifying these hurdles
Auto-Parallelizer
• The MIPSpro auto-parallelizer (APO)
can be used both for automatically
parallelizing loops and for determining
the reasons which prevent a loop from
being parallelized
• The auto-parallelizer is activated using
command line option apo to the f90,
f77, cc, and CC compilers
• Other auto-parallelizer options: apo
list and mplist
Auto-Parallelizer
• Example:
f90 -apo list -mplist myprog.f
• apo list enables APO and generates
the file myprog.list which describes
which loops have been parallelized,
which have not, and why not
• mplist generates the parallelized
source program myprog.w2f.f
(myprog.w2c.c for C) equivalent to the
original code myprog.f
For More Information
About the Auto-Parallelizer
• The ProMP Parallel Analyzer View
product consists of the program cvpav
• Cvpav analyzes files created by
compiling with the option –apo keep
• Try out the tutorial examples:
cd /usr/demos/ProMP/omp_tutorial
make
cvpav -f omp_demo.f
Data Races
• Parallelizing a loop with data
dependencies causes data races:
unordered or interfering accesses by
multiple threads to shared variables,
which make the values of these
variables different from the values
assumed in a serial execution
• A program with data races produces
unpredictable results, which depend on
thread scheduling and speed.
Types of Data Dependencies
• Reduction operations:
const int n = 4096;
int a[n], i, sum=0;
for (i = 0; i < n; i++) {
sum += a[i];
}
– Easy to parallelize using reduction
variables
Types of Data Dependencies
– Auto-parallelizer is able to detect
reduction and parallelize it
const int n = 4096;
int a[n], i, sum = 0;
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < n; i++) {
sum += a[i];
}
Types of Data Dependencies
• Carried dependence on a shared
array, e.g., recurrence:
const int n = 4096;
int a[n], i;
for (i = 0; i < n-1; i++) {
a[i] = a[i+1];
}
– Non-trivial to eliminate, the auto-
parallelizer cannot do it
Parallelizing the Recurrence
#define N 16384
int a[N], work[N+1];
// Save border element
work[N]= a[0];
// Save & shift even indices
#pragma omp parallel for
for ( i = 2; i < N; i+=2)
{
work[i-1] = a[i];
}
// Update even indices from odd
#pragma omp parallel for
for ( i = 0; i < N-1; i+=2)
{
a[i] = a[i+1];
}
// Update odd indices with even
#pragma omp parallel for
for ( i = 1; i < N-1; i+=2)
{
a[i] = work[i];
}
// Set border element
a[N-1] = work[N];
Idea: Segregate even and odd indices
Performing Reduction
The bad scalability of the reduction clause
affects its usefulness, e.g., bad speedup
when summing the elements of a matrix:
#define N 1<<12
#define M 16
int i, j;
double a[N][M], sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < N; i++)
for (j = 0; j < M; j++)
sum += a[i][j];
Parallelizing the Sum
#define N 1<<12
#define M 16
int main() {
double a[N][M], sum = 0.0;
#pragma distribute a[block][*]
int i, j = 0;
#pragma omp parallel private(i,j)
{
double mysum = 0.0;
// initialization of a
// not shown
// compute partial sum
#pragma omp for nowait
for (i = 0; i < N; i++)
for (j = 0; j < M; i++)
mysum += a[i][j];
}
// each thread adds its
// partial sum
#pragma omp atomic
sum += mysum;
}
}
Idea: Use explicit partial sums and combine
them atomically
Sum and Product Speedup
Loop Fusion
• Increases the work in the loop body
• Better serial programs: fusion
promotes software pipelining and
reduces the frequency of branches
• Better OpenMP programs: fusion
reduces synchronization and
scheduling overhead
– fewer parallel regions and work-
sharing constructs
Promoting Loop Fusion
• Loop fusion inhibited by statements
between loops which may have
dependencies with data accessed by
the loops
• Promote fusion: reorder the code to get
loops which are not separated by
statements creating data dependencies
• Use one parallel do construct for
several adjacent loops; may leave it to
the compiler to actually perform fusion
Fusion-friendly code
integer,parameter::n=4096
real :: sum, a(n)
do i=1,n
a(i) = sqrt(dble(i*i+1))
enddo
sum = 0.d0
do i=1,n
sum = sum + a(i)
enddo
integer,parameter::n=4096
real :: sum, a(n)
sum = 0.d0
do i=1,n
a(i) = sqrt(dble(i*i+1))
enddo
do i=1,n
sum = sum + a(i)
enddo
Unfriendly Friendly
Tradeoffs in Parallelization
• To increase parallel fraction of work
when parallelizing loops, it is best to
parallelize the outermost loop of a
nested loop
• However, doing so may require loop
transformations such as loop
interchanges, which can destroy cache
friendliness, e.g., defeat cache blocking
Tradeoffs in Parallelization
• Static loop scheduling in large chunks
per thread promotes cache and page
locality but may not achieve load
balancing
• Dynamic and interleaved scheduling
achieve good load balancing but cause
poor locality of data references
Tuning the Parallel Code
• Examine resource usage, e.g.,
execution time, number of floating
point operations, primary, secondary,
and TLB cache misses and identify
– the performance bottleneck
– the routines generating the
bottleneck
Useful SGI tools: perfex, ssrun, prof
• Correct the performance problem and
verify the desired speedup.
• In C, use SGI function syssgi
#include <sys/syssgi.h>
ptrdiff_t syssgi (int request, …)
with a request value of SGI_PHYSP
• In Fortran, use the intrinsic function
integer dsm_home_threadnum
thread=dsm_home_threadnum(arr(i))
- see lab exercise
Investigating Data Placement
SGI Performance Tools
• MIPSpro compilers and libraries
• ProDev workshop (formerly CASE
Vision) : cvd, cvperf, cvstatic
• ProMP: parallel analyzer: cvpav
• SpeedShop: profiling execution and
reporting profile data
• Perfex: per process event count statistics
• dprof: address space profiling
Performance Data
• Timing: time spent in various sections of
the program
• Events captured by the performance
counters of the R1X000 CPU.
- 32 events, divided in two equal sets
- Examples: clock cycles, L1 and L2 cache,
and TLB misses, floating point operations,
number of instructions
• I/O system calls, heap malloc and
free, floating point exceptions
Speedshop Profiling
• Speedshop is tool package which
supports profiling at the function and
source line level
• Uses several methods for collecting
information
– PC and call stack sampling
– basic block counting
– exception tracing
Perfex Tool
• Provides event statistics at the
process level
• Reports the number of occurrences of
the events captured by the R1X000
hardware counters in each process of
a parallel program
• In addition, reports information
derived from the event counts, e.g.
MFLOPS, memory bandwidth
Perfex Tool
perfex -mp [other options] a.out
• To profile secondary cache misses in the
data cache (event 26) and instruction
cache (event 10):
perfex -mp -e 26 -e 10 a.out
• To multiplex all 32 events (-a) , get time
estimates (-y) and trace exceptions (-x)
perfex -a -x -y a.out
Speedshop
• Data Collection
– ssrun main data collection tool.
Running it on a.out creates the files
a.out.experiment.mPID and a.out.experiment.pPID
– ssusage summary of resources
used, similar to the time commands
– ssapi API for caliper points
• Data Analysis
– prof
Ssrun sampling Experiments
Statistical sampling, triggered by a preset
time base or by overflow of hardware
counters
-pcsamp PC sampling gives user CPU
time
-usertime call stack sampling, gives
user and system CPU time
-totaltime call stack sampling, gives
walltime
Ssrun sampling
Sampling triggered by overflow of R1X000
hardware counters
-gi_hwc Graduated instructions
-gfp_hwc Floating point instructions
-ic_hwc Misses in L1 I-cache
-dc_hwc Misses in L1 D-cache
-dsc_hwc Data misses in L2 cache
-tlb_hwc TLB misses
-prof_hwc User selected event
User selected sampling
• Select a hardware counter, say
secondary cache misses (26), and an
overflow value
setenv _SPEEDSHOP_HWC_COUNTER_NUMBER 26
setenv _SPEEDSHOP_HWC_COUNTER_OVERFLOW 99
• Run the experiment
ssrun -prof_hwc a.out
• Default counter is L1 I-cache misses
(9) and default overflow is 2,053
Ssrun ideal and tracing
experiments
• Ideal Experiment: basic block counting
-ideal counts the number of times
each basic block is executed and
estimates the time. Descendant of pixie
• Tracing
-fpe floating point exceptions
-io file open, read,write, close
-heap malloc and free
Prof Tool
• Display event counts or time in routines
sorted in descending order of the counts
• Source line granularity with command line
option -h or -l
• For ideal and usertime experiments get
call hierarchy with -butterfly option
• For ideal experiment can get architecture
information with the -archinfo option
• Cut off report at top 100-p% with -quit p%
Address Space Profiling: dprof
• Gives per process histograms of page
accesses
• Sampling with a specified time base
– the current instruction is interrupted
– the address of the operand referenced by the
interrupted instruction is recorded
• Time base is either the interval timer or
an R1X000 hardware counter overflow
• R1X000 counters: man r10k_counters
Data Profiling: dprof
• Syntax
dprof [-hwpc [-cntr n] [-ovfl m]]
[-itimer [-ms t]] [-out profile_file]
a.out
• Default is interval timer ( -itimer )
with t=100 ms
• Can select hardware counter (-hwpc)
which has the defaults
n = 0 is the R1X000 cycle counter
m=10000 is the counter’s overflow value
The Future of OpenMP
• Data placement directives will
become part of OpenMP
– affinity scheduling may be a useful
feature
• It is desirable to add parallel
input/output to OpenMP
• Java binding of OpenMP
Image class
117
class Image {
public:
short* mData;
int mWidth, mHeight, mDepth;
int mVoxelsPerSlice;
int mVoxelsPerVolume;
short* mSlicePointers; // Pointers to the start of each slice
short getVoxel ( int x, int y, int z ) {...}
void setVoxel ( int x, int y, int z, short v ) {...}
};
Threshold – OpenMP #1
118
void doThreshold ( Image* in, Image* out ) {
#pragma omp parallel for
for ( int z = 0; z < in->mDepth; z++ ) {
for ( int y = 0; y < in->mHeight; y++ ) {
for ( int x = 0; x < in->mWidth; x++ ) {
if ( in->getVoxel(x,y,z) > 100 ) {
out->setVoxel(x,y,z,1);
} else {
out->setVoxel(x,y,z,0);
}
}
}
}
}
// NB: can loop over slices, rows or columns by moving
// pragma, but must choose at compile time
Threshold – OpenMP #2
119
void doThreshold ( Image* in, Image* out ) {
#pragma omp parallel for
for ( int s = 0; s < in->mVoxelsPerVolume; s++ ) {
if ( in->mData[s] > 100 ) {
out->mData[s] = 1;
} else {
out->mData[s] = 0;
}
}
}
// Likely a lot faster than previous code
References
Lawrence Livermore National Laboratory
www.llnl.gov/computing/tutorials/workshops/workshop/openMP/MAIN.html
Ohio Supercomputing Center
oscinfo.osc.edu/training/openmp/big
Minnesota Supercomputing Institute
www.msi.umn.edu/tutorials/shared_tutorials/openMP
Edinburgh OpenMP Microbenchmarks
www.epcc.ed.ac.uk/research/openmpbench
Mattson and Eigenmann Tutorial
dynamo.ecn.purdue.edu/~eigenman/EE563/Handouts/BWsc00introOMP.pdf
Mattson and Eigenmann Advanced OpenMP
dynamo.ecn.purdue.edu/~eigenman/EE563/Handouts/BWsc00advancedOMP.pdf

More Related Content

What's hot

Parallel Programming Primer
Parallel Programming PrimerParallel Programming Primer
Parallel Programming PrimerSri Prasanna
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
Subhas Kumar Ghosh
 
Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
Apache HAMA: An Introduction toBulk Synchronization Parallel on HadoopApache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
guest20d395b
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
Subhas Kumar Ghosh
 
On using monolingual corpora in neural machine translation
On using monolingual corpora in neural machine translationOn using monolingual corpora in neural machine translation
On using monolingual corpora in neural machine translation
NAIST Machine Translation Study Group
 
Apache hama 0.2-userguide
Apache hama 0.2-userguideApache hama 0.2-userguide
Apache hama 0.2-userguideEdward Yoon
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
Chirag Ahuja
 
Chap2 - ADSP 21K Manual - Processor and Software Overview
Chap2 - ADSP 21K Manual - Processor and Software OverviewChap2 - ADSP 21K Manual - Processor and Software Overview
Chap2 - ADSP 21K Manual - Processor and Software Overview
SethCopeland
 
Implementing a Distributed Hash Table with Scala and Akka
Implementing a Distributed Hash Table with Scala and AkkaImplementing a Distributed Hash Table with Scala and Akka
Implementing a Distributed Hash Table with Scala and Akka
Tristan Penman
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache HamaTommaso Teofili
 
Dc ch11 : routing in switched networks
Dc ch11 : routing in switched networksDc ch11 : routing in switched networks
Dc ch11 : routing in switched networks
Syaiful Ahdan
 
Parallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical SectionParallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical Section
Tony Albrecht
 
Et3003 sem2-1314-7 network layers iv (ipv4)
Et3003 sem2-1314-7 network layers iv (ipv4)Et3003 sem2-1314-7 network layers iv (ipv4)
Et3003 sem2-1314-7 network layers iv (ipv4)
Tutun Juhana
 
All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations
Syed Zaid Irshad
 
Parallel Programming in .NET
Parallel Programming in .NETParallel Programming in .NET
Parallel Programming in .NETSANKARSAN BOSE
 
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
MLconf
 
Communication model of parallel platforms
Communication model of parallel platformsCommunication model of parallel platforms
Communication model of parallel platforms
Syed Zaid Irshad
 
Parallelism in sql server
Parallelism in sql serverParallelism in sql server
Parallelism in sql server
Enrique Catala Bañuls
 
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
lucenerevolution
 
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitVery Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
Kavita Ganesan
 

What's hot (20)

Parallel Programming Primer
Parallel Programming PrimerParallel Programming Primer
Parallel Programming Primer
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
Apache HAMA: An Introduction toBulk Synchronization Parallel on HadoopApache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
 
On using monolingual corpora in neural machine translation
On using monolingual corpora in neural machine translationOn using monolingual corpora in neural machine translation
On using monolingual corpora in neural machine translation
 
Apache hama 0.2-userguide
Apache hama 0.2-userguideApache hama 0.2-userguide
Apache hama 0.2-userguide
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Chap2 - ADSP 21K Manual - Processor and Software Overview
Chap2 - ADSP 21K Manual - Processor and Software OverviewChap2 - ADSP 21K Manual - Processor and Software Overview
Chap2 - ADSP 21K Manual - Processor and Software Overview
 
Implementing a Distributed Hash Table with Scala and Akka
Implementing a Distributed Hash Table with Scala and AkkaImplementing a Distributed Hash Table with Scala and Akka
Implementing a Distributed Hash Table with Scala and Akka
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
 
Dc ch11 : routing in switched networks
Dc ch11 : routing in switched networksDc ch11 : routing in switched networks
Dc ch11 : routing in switched networks
 
Parallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical SectionParallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical Section
 
Et3003 sem2-1314-7 network layers iv (ipv4)
Et3003 sem2-1314-7 network layers iv (ipv4)Et3003 sem2-1314-7 network layers iv (ipv4)
Et3003 sem2-1314-7 network layers iv (ipv4)
 
All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations
 
Parallel Programming in .NET
Parallel Programming in .NETParallel Programming in .NET
Parallel Programming in .NET
 
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
 
Communication model of parallel platforms
Communication model of parallel platformsCommunication model of parallel platforms
Communication model of parallel platforms
 
Parallelism in sql server
Parallelism in sql serverParallelism in sql server
Parallelism in sql server
 
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitVery Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
 

Viewers also liked

Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
CherryBerry2
 
Programming using Open Mp
Programming using Open MpProgramming using Open Mp
Programming using Open MpAnshul Sharma
 
Open mp
Open mpOpen mp
Open mp
Gopi Saiteja
 
Openmp combined
Openmp combinedOpenmp combined
Openmp combined
Brett Estrade
 
Open mp intro_01
Open mp intro_01Open mp intro_01
Open mp intro_01
Oleg Nazarevych
 
Presentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel ProgrammingPresentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel Programming
Vengada Karthik Rangaraju
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMP
jbp4444
 
OpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersOpenMP Tutorial for Beginners
OpenMP Tutorial for Beginners
Dhanashree Prasad
 
Deep C
Deep CDeep C
Deep C
Olve Maudal
 

Viewers also liked (11)

Open MP cheet sheet
Open MP cheet sheetOpen MP cheet sheet
Open MP cheet sheet
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
 
Programming using Open Mp
Programming using Open MpProgramming using Open Mp
Programming using Open Mp
 
MPI n OpenMP
MPI n OpenMPMPI n OpenMP
MPI n OpenMP
 
Open mp
Open mpOpen mp
Open mp
 
Openmp combined
Openmp combinedOpenmp combined
Openmp combined
 
Open mp intro_01
Open mp intro_01Open mp intro_01
Open mp intro_01
 
Presentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel ProgrammingPresentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel Programming
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMP
 
OpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersOpenMP Tutorial for Beginners
OpenMP Tutorial for Beginners
 
Deep C
Deep CDeep C
Deep C
 

Similar to Nbvtalkataitamimageprocessingconf

Cc module 3.pptx
Cc module 3.pptxCc module 3.pptx
Cc module 3.pptx
ssuserbead51
 
Parallel Computing - Lec 5
Parallel Computing - Lec 5Parallel Computing - Lec 5
Parallel Computing - Lec 5
Shah Zaib
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)
Akhila Prabhakaran
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented Model
Nikhil Sharma
 
VTU 6th Sem Elective CSE - Module 3 cloud computing
VTU 6th Sem Elective CSE - Module 3 cloud computingVTU 6th Sem Elective CSE - Module 3 cloud computing
VTU 6th Sem Elective CSE - Module 3 cloud computing
Sachin Gowda
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# Way
Bishnu Rawal
 
System models for distributed and cloud computing
System models for distributed and cloud computingSystem models for distributed and cloud computing
System models for distributed and cloud computingpurplesea
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
Mateusz Dymczyk
 
Concurrency Programming in Java - 01 - Introduction to Concurrency Programming
Concurrency Programming in Java - 01 - Introduction to Concurrency ProgrammingConcurrency Programming in Java - 01 - Introduction to Concurrency Programming
Concurrency Programming in Java - 01 - Introduction to Concurrency Programming
Sachintha Gunasena
 
SparkNet presentation
SparkNet presentationSparkNet presentation
SparkNet presentation
Sneh Pahilwani
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 
Arc 300-3 ade miller-en
Arc 300-3 ade miller-enArc 300-3 ade miller-en
Arc 300-3 ade miller-en
lonegunman
 
Parallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxParallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptx
krnaween
 
Rseminarp
RseminarpRseminarp
Slicing of Object-Oriented Programs
Slicing of Object-Oriented ProgramsSlicing of Object-Oriented Programs
Slicing of Object-Oriented Programs
Praveen Penumathsa
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Multicore_Architecture Book.pdf
Multicore_Architecture Book.pdfMulticore_Architecture Book.pdf
Multicore_Architecture Book.pdf
SwatantraPrakash5
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
Mani Goswami
 

Similar to Nbvtalkataitamimageprocessingconf (20)

Cc module 3.pptx
Cc module 3.pptxCc module 3.pptx
Cc module 3.pptx
 
Parallel Computing - Lec 5
Parallel Computing - Lec 5Parallel Computing - Lec 5
Parallel Computing - Lec 5
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented Model
 
VTU 6th Sem Elective CSE - Module 3 cloud computing
VTU 6th Sem Elective CSE - Module 3 cloud computingVTU 6th Sem Elective CSE - Module 3 cloud computing
VTU 6th Sem Elective CSE - Module 3 cloud computing
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# Way
 
Lecture6
Lecture6Lecture6
Lecture6
 
System models for distributed and cloud computing
System models for distributed and cloud computingSystem models for distributed and cloud computing
System models for distributed and cloud computing
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Concurrency Programming in Java - 01 - Introduction to Concurrency Programming
Concurrency Programming in Java - 01 - Introduction to Concurrency ProgrammingConcurrency Programming in Java - 01 - Introduction to Concurrency Programming
Concurrency Programming in Java - 01 - Introduction to Concurrency Programming
 
SparkNet presentation
SparkNet presentationSparkNet presentation
SparkNet presentation
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
 
Lecture7
Lecture7Lecture7
Lecture7
 
Arc 300-3 ade miller-en
Arc 300-3 ade miller-enArc 300-3 ade miller-en
Arc 300-3 ade miller-en
 
Parallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxParallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptx
 
Rseminarp
RseminarpRseminarp
Rseminarp
 
Slicing of Object-Oriented Programs
Slicing of Object-Oriented ProgramsSlicing of Object-Oriented Programs
Slicing of Object-Oriented Programs
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Multicore_Architecture Book.pdf
Multicore_Architecture Book.pdfMulticore_Architecture Book.pdf
Multicore_Architecture Book.pdf
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 

More from Nagasuri Bala Venkateswarlu

Building mathematicalcraving
Building mathematicalcravingBuilding mathematicalcraving
Building mathematicalcraving
Nagasuri Bala Venkateswarlu
 
Nbvtalkonmoocs
NbvtalkonmoocsNbvtalkonmoocs
Nbvtalkatbzaonencryptionpuzzles
NbvtalkatbzaonencryptionpuzzlesNbvtalkatbzaonencryptionpuzzles
Nbvtalkatbzaonencryptionpuzzles
Nagasuri Bala Venkateswarlu
 
Nbvtalkon what is engineering(Revised)
Nbvtalkon what is engineering(Revised)Nbvtalkon what is engineering(Revised)
Nbvtalkon what is engineering(Revised)
Nagasuri Bala Venkateswarlu
 
Swift: A parallel scripting for applications at the petascale and beyond.
Swift: A parallel scripting for applications at the petascale and beyond.Swift: A parallel scripting for applications at the petascale and beyond.
Swift: A parallel scripting for applications at the petascale and beyond.
Nagasuri Bala Venkateswarlu
 
Nbvtalkon what is engineering
Nbvtalkon what is engineeringNbvtalkon what is engineering
Nbvtalkon what is engineering
Nagasuri Bala Venkateswarlu
 
Fourth paradigm
Fourth paradigmFourth paradigm
Let us explore How To solve Technical Education in India.
Let us explore How To solve Technical Education in India.Let us explore How To solve Technical Education in India.
Let us explore How To solve Technical Education in India.
Nagasuri Bala Venkateswarlu
 
Nbvtalkstaffmotivationataitam
NbvtalkstaffmotivationataitamNbvtalkstaffmotivationataitam
Nbvtalkstaffmotivationataitam
Nagasuri Bala Venkateswarlu
 
top 10 Data Mining Algorithms
top 10 Data Mining Algorithmstop 10 Data Mining Algorithms
top 10 Data Mining Algorithms
Nagasuri Bala Venkateswarlu
 
Dip
DipDip
Anits dip
Anits dipAnits dip
Bglrsession4
Bglrsession4Bglrsession4
Gmrit2
Gmrit2Gmrit2
Introduction to socket programming nbv
Introduction to socket programming nbvIntroduction to socket programming nbv
Introduction to socket programming nbv
Nagasuri Bala Venkateswarlu
 
Clusteryanam
ClusteryanamClusteryanam
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
Nagasuri Bala Venkateswarlu
 
Webinaron muticoreprocessors
Webinaron muticoreprocessorsWebinaron muticoreprocessors
Webinaron muticoreprocessors
Nagasuri Bala Venkateswarlu
 
Nbvtalkonfeatureselection
NbvtalkonfeatureselectionNbvtalkonfeatureselection
Nbvtalkonfeatureselection
Nagasuri Bala Venkateswarlu
 
Do We need to rejuvenate our self in Statistics to herald the 21st Century re...
Do We need to rejuvenate our self in Statistics to herald the 21st Century re...Do We need to rejuvenate our self in Statistics to herald the 21st Century re...
Do We need to rejuvenate our self in Statistics to herald the 21st Century re...
Nagasuri Bala Venkateswarlu
 

More from Nagasuri Bala Venkateswarlu (20)

Building mathematicalcraving
Building mathematicalcravingBuilding mathematicalcraving
Building mathematicalcraving
 
Nbvtalkonmoocs
NbvtalkonmoocsNbvtalkonmoocs
Nbvtalkonmoocs
 
Nbvtalkatbzaonencryptionpuzzles
NbvtalkatbzaonencryptionpuzzlesNbvtalkatbzaonencryptionpuzzles
Nbvtalkatbzaonencryptionpuzzles
 
Nbvtalkon what is engineering(Revised)
Nbvtalkon what is engineering(Revised)Nbvtalkon what is engineering(Revised)
Nbvtalkon what is engineering(Revised)
 
Swift: A parallel scripting for applications at the petascale and beyond.
Swift: A parallel scripting for applications at the petascale and beyond.Swift: A parallel scripting for applications at the petascale and beyond.
Swift: A parallel scripting for applications at the petascale and beyond.
 
Nbvtalkon what is engineering
Nbvtalkon what is engineeringNbvtalkon what is engineering
Nbvtalkon what is engineering
 
Fourth paradigm
Fourth paradigmFourth paradigm
Fourth paradigm
 
Let us explore How To solve Technical Education in India.
Let us explore How To solve Technical Education in India.Let us explore How To solve Technical Education in India.
Let us explore How To solve Technical Education in India.
 
Nbvtalkstaffmotivationataitam
NbvtalkstaffmotivationataitamNbvtalkstaffmotivationataitam
Nbvtalkstaffmotivationataitam
 
top 10 Data Mining Algorithms
top 10 Data Mining Algorithmstop 10 Data Mining Algorithms
top 10 Data Mining Algorithms
 
Dip
DipDip
Dip
 
Anits dip
Anits dipAnits dip
Anits dip
 
Bglrsession4
Bglrsession4Bglrsession4
Bglrsession4
 
Gmrit2
Gmrit2Gmrit2
Gmrit2
 
Introduction to socket programming nbv
Introduction to socket programming nbvIntroduction to socket programming nbv
Introduction to socket programming nbv
 
Clusteryanam
ClusteryanamClusteryanam
Clusteryanam
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
 
Webinaron muticoreprocessors
Webinaron muticoreprocessorsWebinaron muticoreprocessors
Webinaron muticoreprocessors
 
Nbvtalkonfeatureselection
NbvtalkonfeatureselectionNbvtalkonfeatureselection
Nbvtalkonfeatureselection
 
Do We need to rejuvenate our self in Statistics to herald the 21st Century re...
Do We need to rejuvenate our self in Statistics to herald the 21st Century re...Do We need to rejuvenate our self in Statistics to herald the 21st Century re...
Do We need to rejuvenate our self in Statistics to herald the 21st Century re...
 

Recently uploaded

14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABSDESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
itech2017
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
yokeleetan1
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 

Recently uploaded (20)

14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABSDESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 

Nbvtalkataitamimageprocessingconf

  • 1. Writing OpenMP Programs on Many and Multi Core Machines Prof NB Venkateswarlu ISTE Visiting Professor 2010-11 CSE, AITAM, Tekkali venkat_ritch@yahoo.com www.ritchcenter.com/nbv
  • 2. Agenda • Why OpenMP ? • Elements of OpenMP • Scalable Speedup and Data Locality • Parallelizing Sequential Programs • Breaking data dependencies • Avoiding synchronization overheads • Achieving Cache and Page Locality • SGI Tools for Performance Analysis and Tuning
  • 3. Why OpenMP ? • Parallel programming is more difficult than sequential programming • OpenMP is a scalable, portable, incremental approach to designing shared memory parallel programs • OpenMP supports – fine and coarse grained parallelism – data and control parallelism
  • 4. What is OpenMP ? Three components: • Set of compiler directives for – creating teams of threads – sharing the work among threads – synchronizing the threads • Library routines for setting and querying thread attributes • Environment variables for controlling run- time behavior of the parallel program
  • 5. Elements of OpenMP • Parallel regions and work sharing • Data scoping • Synchronization • Compiling and running OpenMP programs
  • 6. Parallelism in OpenMP • The parallel region is the construct for creating multiple threads in an OpenMP program • A team of threads is created at run time for a parallel region • A nested parallel region is allowed, but may contain a team of one thread • Nested parallelism is enabled with setenv OMP_NESTED TRUE
  • 7. Parallelism in OpenMP beginning of parallel region fork join fork join end of parallel region
  • 8. Hello World in OpenMP #include <omp.h> int main() { int iam =0, np = 1; #pragma omp parallel private(iam, np) { #if defined (_OPENMP) np = omp_get_num_threads(); iam = omp_get_thread_num(); #endif printf(“Hello from thread %d out of %d n”, iam, np); } } parallel region directive with data scoping clause
  • 9. Specifying Parallel Regions • Fortran ! $OMP PARALLEL [clause [clause…]] ! Block of code executed by all threads !$OMP END PARALLEL • C and C++ #pragma omp parallel [clause [clause...]] { /* Block executed by all threads */
  • 10.
  • 11. Work sharing in OpenMP • Two ways to specify parallel work: – Explicitly coded in parallel regions – Work-sharing constructs » DO and for constructs: parallel loops » sections » single • SPMD type of parallelism supported
  • 12. Work and Data Partitioning Loop parallelization • distribute the work among the threads, without explicitly distributing the data. • scheduling determines which thread accesses which data • communication between threads is implicit, through data sharing • synchronization via parallel constructs or is explicitly inserted in the code
  • 13. Data Partitioning & SPMD • Data is distributed explicitly among processes • With message passing, e.g., MPI, where no data is shared, data is explicitly communicated • Synchronization is explicit or embedded in communication • With parallel regions in OpenMP, both SPMD and data sharing are supported
  • 14. Pros and Cons of SPMD » Potentially higher parallel fraction than with loop parallelism » The fewer parallel regions, the less overhead » More explicit synchronization needed than for loop parallelization » Does not promote incremental parallelization and requires manually assigning data subsets to threads
  • 15. SPMD Example program mat_init implicit none integer, parameter::N=1024 real A(N,N) integer :: iam, np iam = 0 np = 1 !$omp parallel private(iam,np) np = omp_get_num_threads() iam = omp_get_thread_num() ! Each thread calls work call work(N, A, iam, np) !$omp end parallel end subroutine work(n, A, iam, np) integer n, iam, n real A(n,n) integer :: chunk,low,high,i,j chunk = (n + np - 1)/np low = 1 + iam*chunk high=min(n,(iam+1)*chunk) do j = low, high do I=1,n A(I,j)=3.14 + & sqrt(real(i*i*i+j*j+i*j*j)) enddo enddo return A single parallel region, no scheduling needed, each thread explicitly determines its work
  • 16. Extent of directives Most directives have as extent a structured block, or basic block, i.e., a sequence of statements with a flow of control that satisfies: • there is only one entry point in the block, at the beginning of the block • there is only one exit point, at the end of the block; the exceptions are that exit() in C and stop in Fortran are allowed
  • 17. Work Sharing Constructs • DO and for : parallelizes a loop, dividing the iterations among the threads • sections : defines a sequence of contiguous blocks, the beginning of each bock being marked by a section directive. The block within each section is assigned to one thread • single: assigns a block of a parallel region to a single thread
  • 18. Specialized Parallel Regions Work-sharing can be specified combined with a parallel region • parallel DO and parallel for : a parallel region which contains a parallel loop • parallel sections, a parallel region that contains a number of section constructs
  • 19. Scheduling • Scheduling assigns the iterations of a parallel loop to the team threads • The directives [parallel] do and [parallel] for take the clause schedule(type [,chunk]) • The optional chunk is a loop-invariant positive integer specifying the number of contiguous iterations assigned to a thread
  • 20. Scheduling The type can be one of • static threads are statically assigned chunks of size chunk in a round-robin fashion. The default for chunk is ceiling(N/p) where N is the number of iterations and p is the number of processors • dynamic threads are dynamically assigned chunks of size chunk, i.e.,
  • 21. Scheduling when a thread is ready to receive new work, it is assigned the next pending chunk. Default value for chunk is 1. • guided a variant of dynamic scheduling in which the size of the chunk decreases exponentially from chunk to 1. Default value for chunk is ceiling(N/p)
  • 22. Scheduling • runtime indicates that the schedule type and chunk are specified by the environment variable OMP_SCHEDULE. A chunk cannot be specified with runtime. • Example of run-time specified scheduling setenv OMP_SCHEDULE “dynamic,2”
  • 23. Scheduling • If the schedule clause is missing, an implementation dependent schedule is selected. MIPSpro selects by default the static schedule • Static scheduling has low overhead and provides better data locality • Dynamic and guided scheduling may provide better load balancing
  • 24. Work Sharing Constructs A motivating example for(i=0;I<N;i++) { a[i] = a[i] + b[i];} #pragma omp parallel { int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;I<iend;i++) {a[i]=a[i]+b[i];} } #pragma omp parallel #pragma omp for schedule(static) for(i=0;I<N;i++) { a[i]=a[i]+b[i];} OpenMP parallel region and a work- Sequential code OpenMP Parallel Region OpenMP Parallel Region and a work-sharing for construct
  • 25. Work-sharing ConstructWork-sharing Construct  Threads are assigned an independent set of iterations  Threads must wait at the end of work-sharing construct #pragma omp parallel #pragma omp for Implicit barrier i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 i = 8 i = 9 i = 10 i = 11 i = 12 #pragma omp parallel #pragma omp for for(i = 1, i < 13, i++) c[i] = a[i] + b[i]
  • 26. Combining pragmasCombining pragmas  These two code segments are equivalent #pragma omp parallel { #pragma omp for for (i=0;i< MAX; i++) { res[i] = huge(); } } #pragma omp parallel for for (i=0;i< MAX; i++) { res[i] = huge(); }
  • 27. Types of Extents Two types for the extent of a directive: • static or lexical extent: the code textually enclosed between the beginning and the end of the structured block following the directive • dynamic extent: static extent as well as the procedures called from within the static extent
  • 28. Orphaned Directives A directive which is in the dynamic extent of another directive but not in its static extent is said to be orphaned • Work sharing directives can be orphaned • This allows a work-sharing construct to occur in a subroutine which can be called both by serial and parallel code, improving modularity
  • 29. Directive Binding • Work sharing directives (do, for, sections, and single) as well as master and barrier bind to the dynamically closest parallel directive, if one exists, and have no effect when they are not in the dynamic extent of a parallel region • The ordered directive binds to the enclosing do or for directive having the ordered clause
  • 30. Directive Binding • critical (and atomic) provide mutual exclusive execution (and update) with respect to all the threads in the program
  • 31. Data Scoping • Work-sharing and parallel directives accept data scoping clauses • Scope clauses apply to the static extent of the directive and to variables passed as actual arguments • The shared clause applied to a variable means that all threads will access the single copy of that variable created in the master thread
  • 32. Data Scoping • The private clause applied to a variable means that a volatile copy of the variable is cloned for each thread • Semi-private data for parallel loops: – reduction: variable that is the target of a reduction operation performed by the loop, e.g., sum – firstprivate: initialize the private copy from the value of the shared variable – lastprivate: upon loop exit, master thread holds the value seen by the thread assigned the last loop iteration
  • 33. Threadprivate Data • The threadprivate directive is associated with the declaration of a static variable (C) or common block (Fortran) and specifies persistent data (spans parallel regions) cloned, but not initialized, for each thread • To guarantee persistence, the dynamic threads feature must be disabled setenv OMP_DYNAMIC FALSE
  • 34. Threadprivate Data • threadprivate data can be initialized in a thread using the copyin clause associated with the parallel, parallel do/for, and parallel sections directives • the value stored in the master thread is copied into each team thread • Syntax: copyin (name [,name]) where name is a variable or (in Fortran) a named common block
  • 35. Scoping Rules • Data declared outside a parallel region region is shared by default, except for – loop index variable of parallel do – data declared as threadprivate • Local data in the dynamic extent of a parallel region is private: – subroutine local variables, and – C/C++ blocks within a parallel region
  • 36. Scoping Restrictions • The private clause for a directive in the dynamic extent of a parallel region can be specified only for variables that are shared in the enclosing parallel region – That is, a privatized variable cannot be privatized again • The shared clause is not allowed for the DO (Fortran) or for (C) directive
  • 37. Shared Data • Access to shared data must be mutually exclusive: a thread at a time • For shared arrays, when different threads access mutually exclusive subscripts, synchronization is not needed • For shared scalars, critical sections or atomic updates must be used • Consistency operation: flush directive
  • 38. Synchronization Explicit, via directives: • critical, implements the critical sections, providing mutual exclusion • atomic, implements atomic update of a shared variable • barrier, a thread waits at the point where the directive is placed until all other threads reach the barrier
  • 39. Synchronization • ordered, preserves the order of the sequential execution; can occur at most once inside a parallel loop • flush, creates consistent view of thread-visible data • master, block in a parallel region that is executed by the master thread and skipped by the other threads; unlike single, there is no implied barrier
  • 40. Implicit Synchronization • There is an implied barrier at the end of a parallel region, and of a work- sharing construct for which a nowait clause is not specified • A flush is implied by an explicit or implicit barrier as well as upon entry and exit of a critical or ordered block
  • 41. Directive Nesting • A parallel directive can appear in the dynamic extent of another parallel, i.e., parallel regions can be nested • Work-sharing directives binding to the same parallel directive cannot be nested • An ordered directive cannot appear in the dynamic extent of a critical directive
  • 42. Directive Nesting • A barrier or master directive cannot appear in the dynamic extent of a work-sharing region ( DO or for, sections, and single) or ordered block • In addition, a barrier directive cannot appear in the dynamic extent of a critical or master block
  • 43. Environment Variables Name Value OMP_NUM_THREADS positive number OMP_DYNAMIC TRUE or FALSE OMP_NESTED TRUE or FALSE OMP_SCHEDULE “static,2” • Online help: man openmp
  • 44. Library Routines OpenMP defines library routines that can be divided in three categories 1. Query and set multithreading • get/set number of threads or processors omp_set_num_threads, omp_get_num_threads, omp_in_parallel, … • get thread ID: omp_get_thread_num
  • 45. Library Routines 2. Set and get execution environment • Inquire/set nested parallelism: omp_get_nested omp_set_nested • Inquire/set dynamic number of threads in different parallel regions: omp_set_dynamic omp_get_dynamic
  • 46. Library Routines 3. API for manipulating locks • A lock variable provides thread synchronization, has C type omp_lock_t and Fortran type integer*8, and holds a 64-bit address • Locking routines: omp_init_lock, omp_set_lock,omp_unset_lock... Man pages: omp_threads, omp_lock
  • 47. Reality Check Irregular and ambiguous aspects are sources of language- and implementation dependent behavior: • nowait clause is allowed at the beginning of [parallel] for (C/C+ +) but at the end of [parallel] DO (Fortran) • default clause can specify private scope in Fortran, but not in C/C++
  • 48. Reality Check • Can only privatize full objects, not array elements, or fields of data structures • For a threadprivate variable or block one cannot specify any clause except for the copyin clause • In MIPSpro 7.3.1 one cannot specify in the same directive both the firstprivate and lastprivate clauses for a variable
  • 49. Reality Check • With MIPSpro 7.3.1, when a loop is parallelized with the do (Fortran) or for (C/C++) directive, the indexes of the nested loops are, by default, private in Fortran, but shared in C/C++ Probably, this is a compiler issue • Fortunately, the compiler warns about unsynchronized accesses to shared variables • This does not occur for parallel do or parallel for
  • 50. Compiling and Running • Use MIPSpro with the option -mp both for compiling and linking default -MP:open_mp=ON must be in effect • Fortran: f90 [-freeform] [-cpp]-mp prog.f -freeform needed for free form source -cpp needed when using #ifdef s • C/C++: cc -mp -O3 prog.c CC -mp -O3 prog.C
  • 51. Setting the Number of Threads • Environment variables: setenv OMP_NUM_THREADS 8 if OMP_NUM_THREADS is not set, but MP_SET_NUMTHREADS is set, the latter defines the number of threads • Environment variables can be overridden by the programmer: omp_set_num_threads(int n)
  • 52. Scalable Speedup • Most often the memory is the limit to the performance of a shared memory program • On scalable architectures, the latency and bandwidth of memory accesses depend on the locality of accesses • In achieving good speedup of a shared memory program, data locality is an essential element
  • 53. What Determines Data Locality • Initial data distribution determines on which node the memory is placed – first touch or round-robin system policies – data distribution directives – explicit page placement • Work sharing, e.g., loop scheduling, determines which thread accesses which data • Cache friendliness determines how often main memory is accessed
  • 54. Cache Friendliness For both serial loops and parallel loops • locality of references – spatial locality: use adjacent cache lines and all items in a cache line – temporal locality: reuse same cache line; may employ techniques such as cache blocking • low cache contention – avoid the sharing of cache lines among different objects; may resort to array padding or increasing the rank of an array
  • 55. Cache Friendliness • Contention is an issue specific to parallel loops, e.g., false sharing of cache lines cache friendliness = high locality of references + low contention
  • 56. NUMA machines • Memory hierarchies exist in single-CPU computers and Symmetric Multiprocessors (SMPs) • Distributed shared memory (DSM) machines based on Non-Uniform Memory Architecture (NUMA) add levels to the hierarchy: – local memory has low latency – remote memory has high latency
  • 57. Origin2000 memory hierarchy Level Latency (cycles) register 0 primary cache 2..3 secondary cache 8..10 local main memory & TLB hit 75 remote main memory & TLB hit 250 main memory & TLB miss 2000 page fault 10^6
  • 58. Page Level Locality • An ideal application has full page locality: pages accessed by a processor are on the same node as the processor, and no page is accessed by more than one processor (no page sharing) • Twofold benefit: » low memory latency » scalability of memory bandwidth
  • 59. Page Level Locality • The benefits brought about by page locality are more important for programs that are not cache friendly • We look at several data placement strategies for improving page locality » system based placement » data initialization and directives » combination of system and program directed data placement
  • 60. Page sharing due to alignment array section accessed by processor 2 array section accessed by processor 1 • page 1 page 1 page 2 page 3 • Consider an array whose size is twice the size of a page, and which is distributed between two nodes • Page 1 and page 2 are located one node 1, page 3 is on node 2 • Page 2 is shared by the two processors, due to the array not starting on a page boundary array layout
  • 61. Achieving Page Locality IRIX has two page placement policies: • first-touch: the process which first references a virtual address causes that address to be mapped to a page on the node where the process runs • round-robin: pages allocated to a job are selected from nodes traversed in round- robin order • IRIX uses first-touch, unless setenv _DSM_ROUND_ROBIN
  • 62. Achieving Page Locality IRIX allows to migrate pages between nodes, to adjust the page placement • a page is migrated based on the affinity of data accesses to that page, which is derived at run-time from the per-process cache-miss pattern • page migration follows the page affinity with a delay whose magnitude depends on the aggressiveness of migration
  • 63. Achieving Page Locality • To enable data migration, except for explicitly placed data setenv _DSM_MIGRATION ON • To enable migration of all data setenv _DSM_MIGRATION ALL_ON • To set the aggressiveness of migration setenv _DSM_MIGRATION_LEVEL n where n is an integer between 0 (least aggressive, disables migration) and 100 (most aggressive, the default)
  • 64. Achieving Page Locality Methods, from best to worst • Parallel data initialization, using OpenMP parallel work constructs such as parallel do, combined with operating system’s first-touch placement policy » works with heap, local, global arrays » no data distribution directives needed » can be used with page migration
  • 65. Achieving Page Locality • IRIX round-robin page placement » improves application’s memory bandwidth » no change of code needed » allows both serial and parallel initialization of data in the program
  • 66. Achieving Page Locality • Regular distribution directive » allows serial initialization of data » data has same layout as in a serial program » page granularity of distribution » cannot distribute heap allocated and assumed-size arrays
  • 67. Achieving Page Locality • Page Migration: » makes initial data placement less important, e.g., allows sequential data initialization » improves locality of a computation whose data access pattern changes during the computations » it is useful for programs that have stable affinity for long time intervals
  • 68. Achieving Page Locality » page migration can be combined with other techniques such as first- touch or round-robin » page migration is expensive » page migration implements CPU affinity with a delay
  • 69. Reshaped Distribution • Reshaped distribution directive » no page granularity limitation » data layout is most likely different from the layout in a serial program » code bloating: each routine that is passed a reshaped parameter must have a version specialized for handling reshaped arrays – layout is different for reshaped arrays
  • 70. » cannot reshape initialized data, heap allocated and assumed-size arrays » overhead of indirect addressing » side effect: a global structure or Fortran common block that contains a reshaped array cannot be declared threadprivate, and cannot be localized with the -Wl,-Xlocal option Reshaped Distribution
  • 71. • Regular distribution !$SGI distribute a(d1[,d2]) [onto(p1[,p2])] #pragma distribute a[d1][[d2]] [onto(p1[,p2])] • Reshaped distribution !$SGI distribute_reshape a(d1[,d2]) [onto(p1[,p2])] #pragma distribute_reshape a[d1][[d2]] • Distribution methods are denoted by d1, d2 • Optional clause onto specifies a processor grid n1 x n2 , such that n1/n2 = p1/p2 SGI Data Placement Directives
  • 72. Three distribution methods * means no distribution along the direction in which it appears block distributes the elements of an array in p contiguous chunks of size ceiling(N/p), where N is the extent in the distributed direction and p is the number of processors Distribution Specification
  • 73. cyclic(k) distributes the elements of an array in chunks of size k in a round- robin fashion, i.e., – first processor is assigned array elements 1, K+1, 2*k+1,... (in Fortran) or 0, k, 2k,.. (in C) – second processor is assigned elements 2, K+2,… (in Fortran) or 1,K+1,2*K+1 (in C/C++) Interleaved distribution is obtained for k=1 (the default) and block-cyclic distribution for k>1 Distribution Specification
  • 74. For regular distribution, one should distribute the outer dimension of an array, to minimize the effect of page granularity: • Distribute columns in Fortran !$SGI distribute A(*, block) • Distribute rows in C/C++ #pragma distribute a(block,*) Regular Distribution Tip
  • 75. • Assumed size is not allowed for array formal parameters which are declared as distributed • Specify array size: void foo(int n, double a[n]) { #pragma distribute_reshape a(block) … } Distributed Arrays as Formal Parameters
  • 76. If a reshaped array is declared as threadprivate, the compiler will silently ignore the threadprivate directive » threadprivate is quietly ignored: double a[n] #pragma omp threadprivate(a) #pragma distribute_reshape a(block) Reshaped Array Pitfall
  • 77. Parallelizing Code • Optimize single-CPU performance – maximize cache reuse – eliminate cache misses – compiler flags: -LNO:cache_size2=4m -OPT:IEEE_arithmetic=3 -Ofast=ip27 • Parallelize as high a fraction of the work as possible – preserve cache friendliness
  • 78. Parallelizing Code – avoid synchronization and scheduling overhead: partition in few parallel regions, avoid reduction, single and critical sections, make the code loop fusion friendly, use static scheduling – partition work to achieve load balancing • Check correctness of parallel code – run OpenMP compiled code first on one thread, then on several threads
  • 79. Synchronization Overhead • Parallel regions, work-sharing, and synchronization incur overhead • Edinburgh OpenMP Microbenchmarks, version 1.0, by J. Mark Bull, are used to measure the cost of synchronization on a 32 processor Origin 2000, with 300 MHz R12000 processors, and compiling the benchmarks with MIPSpro Fortran 90 compiler, version 7. 3.1.1m
  • 82. Insights • cost (DO) ~ cost(barrier) • cost (parallel DO) ~ 2 * cost(barrier) • cost (parallel) > cost (parallel DO) • atomic is less expensive than critical • bad scalability for – reduction – mutual exclusion: critical, (un)lock – single
  • 83. Loop Parallelization • Identify the loops that are bottleneck to performance • Parallelize the loops, and ensure that – no data races are created – cache friendliness is preserved – page locality is achieved – synchronization and scheduling overheads are minimized
  • 84. Hurdles to Loop Parallelization • Data dependencies among iterations caused by shared variables • Input/Output operations inside the loop • Calls to thread-unsafe code, e.g., the intrinsic function rtc • Branches out of the loop • Insufficient work in the loop body • The MIPSpro auto-parallelizer helps in identifying these hurdles
  • 85. Auto-Parallelizer • The MIPSpro auto-parallelizer (APO) can be used both for automatically parallelizing loops and for determining the reasons which prevent a loop from being parallelized • The auto-parallelizer is activated using command line option apo to the f90, f77, cc, and CC compilers • Other auto-parallelizer options: apo list and mplist
  • 86. Auto-Parallelizer • Example: f90 -apo list -mplist myprog.f • apo list enables APO and generates the file myprog.list which describes which loops have been parallelized, which have not, and why not • mplist generates the parallelized source program myprog.w2f.f (myprog.w2c.c for C) equivalent to the original code myprog.f
  • 87. For More Information About the Auto-Parallelizer • The ProMP Parallel Analyzer View product consists of the program cvpav • Cvpav analyzes files created by compiling with the option –apo keep • Try out the tutorial examples: cd /usr/demos/ProMP/omp_tutorial make cvpav -f omp_demo.f
  • 88. Data Races • Parallelizing a loop with data dependencies causes data races: unordered or interfering accesses by multiple threads to shared variables, which make the values of these variables different from the values assumed in a serial execution • A program with data races produces unpredictable results, which depend on thread scheduling and speed.
  • 89. Types of Data Dependencies • Reduction operations: const int n = 4096; int a[n], i, sum=0; for (i = 0; i < n; i++) { sum += a[i]; } – Easy to parallelize using reduction variables
  • 90. Types of Data Dependencies – Auto-parallelizer is able to detect reduction and parallelize it const int n = 4096; int a[n], i, sum = 0; #pragma omp parallel for reduction(+:sum) for (i = 0; i < n; i++) { sum += a[i]; }
  • 91. Types of Data Dependencies • Carried dependence on a shared array, e.g., recurrence: const int n = 4096; int a[n], i; for (i = 0; i < n-1; i++) { a[i] = a[i+1]; } – Non-trivial to eliminate, the auto- parallelizer cannot do it
  • 92. Parallelizing the Recurrence #define N 16384 int a[N], work[N+1]; // Save border element work[N]= a[0]; // Save & shift even indices #pragma omp parallel for for ( i = 2; i < N; i+=2) { work[i-1] = a[i]; } // Update even indices from odd #pragma omp parallel for for ( i = 0; i < N-1; i+=2) { a[i] = a[i+1]; } // Update odd indices with even #pragma omp parallel for for ( i = 1; i < N-1; i+=2) { a[i] = work[i]; } // Set border element a[N-1] = work[N]; Idea: Segregate even and odd indices
  • 93. Performing Reduction The bad scalability of the reduction clause affects its usefulness, e.g., bad speedup when summing the elements of a matrix: #define N 1<<12 #define M 16 int i, j; double a[N][M], sum = 0.0; #pragma omp parallel for reduction(+:sum) for (i = 0; i < N; i++) for (j = 0; j < M; j++) sum += a[i][j];
  • 94. Parallelizing the Sum #define N 1<<12 #define M 16 int main() { double a[N][M], sum = 0.0; #pragma distribute a[block][*] int i, j = 0; #pragma omp parallel private(i,j) { double mysum = 0.0; // initialization of a // not shown // compute partial sum #pragma omp for nowait for (i = 0; i < N; i++) for (j = 0; j < M; i++) mysum += a[i][j]; } // each thread adds its // partial sum #pragma omp atomic sum += mysum; } } Idea: Use explicit partial sums and combine them atomically
  • 95. Sum and Product Speedup
  • 96. Loop Fusion • Increases the work in the loop body • Better serial programs: fusion promotes software pipelining and reduces the frequency of branches • Better OpenMP programs: fusion reduces synchronization and scheduling overhead – fewer parallel regions and work- sharing constructs
  • 97. Promoting Loop Fusion • Loop fusion inhibited by statements between loops which may have dependencies with data accessed by the loops • Promote fusion: reorder the code to get loops which are not separated by statements creating data dependencies • Use one parallel do construct for several adjacent loops; may leave it to the compiler to actually perform fusion
  • 98. Fusion-friendly code integer,parameter::n=4096 real :: sum, a(n) do i=1,n a(i) = sqrt(dble(i*i+1)) enddo sum = 0.d0 do i=1,n sum = sum + a(i) enddo integer,parameter::n=4096 real :: sum, a(n) sum = 0.d0 do i=1,n a(i) = sqrt(dble(i*i+1)) enddo do i=1,n sum = sum + a(i) enddo Unfriendly Friendly
  • 99. Tradeoffs in Parallelization • To increase parallel fraction of work when parallelizing loops, it is best to parallelize the outermost loop of a nested loop • However, doing so may require loop transformations such as loop interchanges, which can destroy cache friendliness, e.g., defeat cache blocking
  • 100. Tradeoffs in Parallelization • Static loop scheduling in large chunks per thread promotes cache and page locality but may not achieve load balancing • Dynamic and interleaved scheduling achieve good load balancing but cause poor locality of data references
  • 101. Tuning the Parallel Code • Examine resource usage, e.g., execution time, number of floating point operations, primary, secondary, and TLB cache misses and identify – the performance bottleneck – the routines generating the bottleneck Useful SGI tools: perfex, ssrun, prof • Correct the performance problem and verify the desired speedup.
  • 102. • In C, use SGI function syssgi #include <sys/syssgi.h> ptrdiff_t syssgi (int request, …) with a request value of SGI_PHYSP • In Fortran, use the intrinsic function integer dsm_home_threadnum thread=dsm_home_threadnum(arr(i)) - see lab exercise Investigating Data Placement
  • 103. SGI Performance Tools • MIPSpro compilers and libraries • ProDev workshop (formerly CASE Vision) : cvd, cvperf, cvstatic • ProMP: parallel analyzer: cvpav • SpeedShop: profiling execution and reporting profile data • Perfex: per process event count statistics • dprof: address space profiling
  • 104. Performance Data • Timing: time spent in various sections of the program • Events captured by the performance counters of the R1X000 CPU. - 32 events, divided in two equal sets - Examples: clock cycles, L1 and L2 cache, and TLB misses, floating point operations, number of instructions • I/O system calls, heap malloc and free, floating point exceptions
  • 105. Speedshop Profiling • Speedshop is tool package which supports profiling at the function and source line level • Uses several methods for collecting information – PC and call stack sampling – basic block counting – exception tracing
  • 106. Perfex Tool • Provides event statistics at the process level • Reports the number of occurrences of the events captured by the R1X000 hardware counters in each process of a parallel program • In addition, reports information derived from the event counts, e.g. MFLOPS, memory bandwidth
  • 107. Perfex Tool perfex -mp [other options] a.out • To profile secondary cache misses in the data cache (event 26) and instruction cache (event 10): perfex -mp -e 26 -e 10 a.out • To multiplex all 32 events (-a) , get time estimates (-y) and trace exceptions (-x) perfex -a -x -y a.out
  • 108. Speedshop • Data Collection – ssrun main data collection tool. Running it on a.out creates the files a.out.experiment.mPID and a.out.experiment.pPID – ssusage summary of resources used, similar to the time commands – ssapi API for caliper points • Data Analysis – prof
  • 109. Ssrun sampling Experiments Statistical sampling, triggered by a preset time base or by overflow of hardware counters -pcsamp PC sampling gives user CPU time -usertime call stack sampling, gives user and system CPU time -totaltime call stack sampling, gives walltime
  • 110. Ssrun sampling Sampling triggered by overflow of R1X000 hardware counters -gi_hwc Graduated instructions -gfp_hwc Floating point instructions -ic_hwc Misses in L1 I-cache -dc_hwc Misses in L1 D-cache -dsc_hwc Data misses in L2 cache -tlb_hwc TLB misses -prof_hwc User selected event
  • 111. User selected sampling • Select a hardware counter, say secondary cache misses (26), and an overflow value setenv _SPEEDSHOP_HWC_COUNTER_NUMBER 26 setenv _SPEEDSHOP_HWC_COUNTER_OVERFLOW 99 • Run the experiment ssrun -prof_hwc a.out • Default counter is L1 I-cache misses (9) and default overflow is 2,053
  • 112. Ssrun ideal and tracing experiments • Ideal Experiment: basic block counting -ideal counts the number of times each basic block is executed and estimates the time. Descendant of pixie • Tracing -fpe floating point exceptions -io file open, read,write, close -heap malloc and free
  • 113. Prof Tool • Display event counts or time in routines sorted in descending order of the counts • Source line granularity with command line option -h or -l • For ideal and usertime experiments get call hierarchy with -butterfly option • For ideal experiment can get architecture information with the -archinfo option • Cut off report at top 100-p% with -quit p%
  • 114. Address Space Profiling: dprof • Gives per process histograms of page accesses • Sampling with a specified time base – the current instruction is interrupted – the address of the operand referenced by the interrupted instruction is recorded • Time base is either the interval timer or an R1X000 hardware counter overflow • R1X000 counters: man r10k_counters
  • 115. Data Profiling: dprof • Syntax dprof [-hwpc [-cntr n] [-ovfl m]] [-itimer [-ms t]] [-out profile_file] a.out • Default is interval timer ( -itimer ) with t=100 ms • Can select hardware counter (-hwpc) which has the defaults n = 0 is the R1X000 cycle counter m=10000 is the counter’s overflow value
  • 116. The Future of OpenMP • Data placement directives will become part of OpenMP – affinity scheduling may be a useful feature • It is desirable to add parallel input/output to OpenMP • Java binding of OpenMP
  • 117. Image class 117 class Image { public: short* mData; int mWidth, mHeight, mDepth; int mVoxelsPerSlice; int mVoxelsPerVolume; short* mSlicePointers; // Pointers to the start of each slice short getVoxel ( int x, int y, int z ) {...} void setVoxel ( int x, int y, int z, short v ) {...} };
  • 118. Threshold – OpenMP #1 118 void doThreshold ( Image* in, Image* out ) { #pragma omp parallel for for ( int z = 0; z < in->mDepth; z++ ) { for ( int y = 0; y < in->mHeight; y++ ) { for ( int x = 0; x < in->mWidth; x++ ) { if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } } } // NB: can loop over slices, rows or columns by moving // pragma, but must choose at compile time
  • 119. Threshold – OpenMP #2 119 void doThreshold ( Image* in, Image* out ) { #pragma omp parallel for for ( int s = 0; s < in->mVoxelsPerVolume; s++ ) { if ( in->mData[s] > 100 ) { out->mData[s] = 1; } else { out->mData[s] = 0; } } } // Likely a lot faster than previous code
  • 120. References Lawrence Livermore National Laboratory www.llnl.gov/computing/tutorials/workshops/workshop/openMP/MAIN.html Ohio Supercomputing Center oscinfo.osc.edu/training/openmp/big Minnesota Supercomputing Institute www.msi.umn.edu/tutorials/shared_tutorials/openMP Edinburgh OpenMP Microbenchmarks www.epcc.ed.ac.uk/research/openmpbench Mattson and Eigenmann Tutorial dynamo.ecn.purdue.edu/~eigenman/EE563/Handouts/BWsc00introOMP.pdf Mattson and Eigenmann Advanced OpenMP dynamo.ecn.purdue.edu/~eigenman/EE563/Handouts/BWsc00advancedOMP.pdf