The document discusses different OpenMP constructs for parallelizing work across multiple threads. It describes how a parallel for loop can distribute loop iterations across threads to execute work concurrently. A parallel sections construct allows assigning different tasks to different threads without dependencies between sections. The nowait clause with parallel for or sections avoids implicit barriers between statements.
This presentation deals with how one can utilize multiple cores, while working with C/C++ applications using an API called OpenMP. It's a shared memory programming model, built on top of POSIX thread. Also the fork-join model, parallel design pattern are discussed using PetriNets.
An introduction to the OpenMP parallel programming model.
From the Scalable Computing Support Center at Duke University (http://wiki.duke.edu/display/scsc)
This presentation deals with how one can utilize multiple cores, while working with C/C++ applications using an API called OpenMP. It's a shared memory programming model, built on top of POSIX thread. Also the fork-join model, parallel design pattern are discussed using PetriNets.
An introduction to the OpenMP parallel programming model.
From the Scalable Computing Support Center at Duke University (http://wiki.duke.edu/display/scsc)
The fork/join framework, which is based on the ForkJoinPool class, is an implementation of the Executor interface. It is designed to efficiently run a large number of tasks using a pool of worker threads. A work-stealing technique is used to keep all the worker threads busy, to take full advantage of multiple processors
The fork/join framework, which is based on the ForkJoinPool class, is an implementation of the Executor interface. It is designed to efficiently run a large number of tasks using a pool of worker threads. A work-stealing technique is used to keep all the worker threads busy, to take full advantage of multiple processors
Traditionally, computer software has been written for serial computation. To solve a problem, an algorithm is constructed and implemented as a serial stream of instructions. These instructions are executed on a central processing unit on one computer. Only one instruction may execute at a time—after that instruction is finished, the next is executed.
This talk was given at GTC16 by James Beyer and Jeff Larkin, both members of the OpenACC and OpenMP committees. It's intended to be an unbiased discussion of the differences between the two languages and the tradeoffs to each approach.
Writing concurrent program is hard; maintaining concurrent program even is a nightmare. Actually, a pattern which helps us to write good concurrent code is available, that is, using “channels” to communicate.
This talk will share the channel concept with common libraries, like threading and multiprocessing, to make concurrent code elegant.
It's the talk at PyCon TW 2017 [1] and PyCon APAC/MY 2017 [2].
[1]: https://tw.pycon.org/2017
[2]: https://pycon.my/pycon-apac-2017-program-schedule/
Slides for a college course at City College San Francisco. Based on "The Shellcoder's Handbook: Discovering and Exploiting Security Holes ", by Chris Anley, John Heasman, Felix Lindner, Gerardo Richarte; ASIN: B004P5O38Q.
Instructor: Sam Bowne
Class website: https://samsclass.info/127/127_F19.shtml
This talk is given at AITAM, Tekkali. I have introduced developments in multi-core computers along with their architectural developments. Also, I have explained about high performance computing, where these are used. Also, I have introduced to OpenMP fundamentals with live practice sessions.
1. Work Replication with Parallel Region
#pragma omp parallel
{
for ( j=0; j<10; j++)
printf(“Hellon”);
}
On 5 threads we get
50 print out of hello
since each thread
executes 10 iterations
concurrently with other
10 threads
#pragma omp parallel for
{
for ( j=0; j<10; j++)
printf(“Hellon”);
}
Regardless of # of threads we
get 10 print out of hello
since do loop iterations
are executed in parallel by
team of threads
2. NOWAIT clause : C
#pragma omp parallel
{
#pragma omp for nowait
for ( j=1; j<n; j++)
b[j]= (a[j]+a[j-1]) /2.0;
#pragma omp for
for ( j=1; j<n; j++)
c[j] = d[j]/e[j] ;
}
3. Parallel Sections
• So far we have divided the work of one task among
threads
• Parallel sections allow us to assign different tasks to
different threads
– Need to make sure that none of the later tasks depends
on the results of the earlier ones
• This is helpful where it is difficult or impossible to
speedup individual tasks by executing them in parallel
• The code for the entire sequence of tasks or sections
begins with a sections directive and ends with an end
sections directive
• The beginning of each section is marked by a section
directive which is optional for the very first section
4. Fortran section clause
!$omp parallel sections [clause..]
[!$omp section]
code for 1st
section
!$omp section
code for 2nd
section
!$omp section
code for 3rd
section
.
.
!$omp end parallel sections
6. • clause can be private, firstprivate, lastprivate, reduction
• In Fortran the NOWAIT clause goes at the end:
!$omp end sections [nowait]
• In C/C++ NOWAIT is provided with the omp sections
pragma: #pragma omp sections nowait
• Each section is executed once and each thread executes
zero or more sections
• A thread may execute more than one section if there are
more sections than threads
• It is not possible to determine if one section will be
executed before another or if two sections will be
executed by the same thread
7. Assigning work to single thread
• Within a parallel region a block of code may be executed
just once by any one of the threads in the team
– There is implicit barrier at the end of single (unless
nowait clause supplied)
– Clause can be private or firstprivate
Fortran :
!$omp single [clause…]
block of code to be executed by just one thread
!$omp end single [nowait]
C/C++ :
#pragma omp single [clause,….. nowait]
block of code to be executed by just one thread
8. single for I/O
• Common use of single is for reading in shared input
variables or writing output within a parallel region
• I/O may not be easy to parallelize
9. omp_get_thread_num, omp_get_num_threads
• Remember OpenMP uses fork/join model of
parallelization
• Thread teams are only created within a parallel construct
(parallel do/for, parallel)
• omp_get_thread_num and omp_get_num_threads are only
valid within a parallel construct where you have forked
threads
10. Synchronization
• Critical - for any block of code
• Barrier – where all threads join
• Other synchronization directives :
– master
– ordered
11. Synchronization : master clause
• The master directive identifies a structured block of code
that is executed by the master thread of the team
• No implicit barrier at the end of master directive
• Fortran :
!$omp master
code block
!$omp end master
• C/C++ :
#pragma omp master
code block
12. master example
!$ (or #pragma) parallel
!$ (or #pragma) omp do (or for)
loop I = 1: n
calculation
end loop
!$ (or #pragma) omp master
print result (reduction) from above loop
!$omp end master
more computation
end parallel loop
13. Synchronization : ordered clause
• The structured block following an ordered directive is
executed in the order in which iterations would be
executed in a sequential loop
• Fortran :
!$omp ordered
code block
!$omp end ordered
• C/C++:
#pragma omp ordered
code block
14. ordered example
parallel loop (with parallel do/for) ordered
loop I=1 : n
a[I] = ..calculation…
!$ [OR #pragma] omp ordered
print a[I]
!$omp end ordered
end parallel loop
15. OpenMP Performance
• Each processor has its own cache in shared memory
machine
• Data locality in caches and loop scheduling
• False sharing
16. Data locality in caches and loop scheduling
• loop j = 0 : n
loop k = 0 : n
a[j][k] = k +1 + a[j][k]
• loop j = 0 : n
loop k = 0 : n
a[j][k] = 1./a[j][k]
• Assume each processor’s cache can hold local matrix
• After first loop each processor’s cache will have some
data (cache line dependent). For next iteration it may or
may not get to operate on those data depending on
scheduling
• Static scheduling may provide better cache performance
than dynamic scheduling
17. False sharing
• If different processors update stride one elements of an
array – this can cause poor cache performance
• Cache line has to be invalidated all the time among all the
processors
• Parallel loop with schedule (static,1)
loop j = 1 : n
a[j] = a[j] + j
• Proc1 updates a[1], proc2 updates a[2]… etc.
• Cache line needs to be invalidated for each processor –
this leads to bad performance
18. Look up from OpenMP standard
• Threadprivate
!$omp threadprivate (/cb1/, /cb2/)
#pragma omp threadprivate(list)
• cb1, cb2 are common blocks in fortran, list is a list of
named file scope or namespace scope variables in C
• Threadprivate makes named common blocks private to a
thread but global within the thread
• Threadprivate makes the named file scope or namespace
scope variables (list) private to a thread but file scope
visible within the thread
19. Look up from OpenMP standard
• Atomic directive ensures that specific memory location is
updated atomically – provides better optimization than
critical due to hardware instructions
• C:
#pragma omp parallel for
for (I =1; I< n; I ++) {
#pragma omp atomic
a[index[I]] = a[index[I]] + 1 }
• Fortan:
!$omp parallel do
do I = 1, n
$omp atomic
y(index(j)) = y(index(j)) + c