- OpenMP provides compiler directives and library calls to incrementally parallelize applications for shared memory multiprocessor systems. It works by allowing the master thread to spawn worker threads to perform work concurrently using directives like parallel and parallel do.
- Variables in OpenMP can be shared, private, or reduction. Shared variables are accessible by all threads while private variables have a separate copy for each thread. Reduction variables are used to combine values across threads.
- Synchronization is needed to coordinate thread access and ensure correct results. The barrier directive synchronizes threads at the end of parallel regions.
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP ProgrammingIntel® Software
Discover, extend, and modernize your current development approach for hetergeneous compute with standards-based OpenFabrics Interfaces* (OFI), message passing interface (MPI), and OpenMP* programming methods on Intel® Xeon Phi™ processors.
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP ProgrammingIntel® Software
Discover, extend, and modernize your current development approach for hetergeneous compute with standards-based OpenFabrics Interfaces* (OFI), message passing interface (MPI), and OpenMP* programming methods on Intel® Xeon Phi™ processors.
Very long instruction word or VLIW refers to a processor architecture designed to take advantage of instruction level parallelism
This type of processor architecture is intended to allow higher performance without the inherent complexity of some other approaches.
Stream processing in python with Apache Samza and BeamHai Lu
Apache Samza is the streaming engine being used at LinkedIn that processes around 2 trillion messages daily. A while back we announced Samza's integration with Apache Beam, a great success which leads to our Samza Beam API. Now an UPGRADE of our APIs - we're now supporting Stream Processing in Python! This work has made stream processing more accessible and enabled many interesting use cases, particularly in the area of machine learning. The Python API is based on our work of Samza runner for Apache Beam. In this talk, we will quickly review our work on Samza runner, and then how we extended it to support portability in Beam (Python specifically). In addition to technical and architectural details, we will also talk about how we bridged Python and Java ecosystems at LinkedIn with the Python API, together with different use cases.
checking dependencies between instructions to determine which instructions can be grouped together for parallel execution;
assigning instructions to the functional units on the hardware;
determining when instructions are initiated placed together into a single word.
Next Generation MPICH: What to Expect - Lightweight Communication and MoreIntel® Software
MPICH is a widely used, open-source implementation of the message passing interface (MPI) standard. It has been ported to many platforms and used by several vendors and research groups as the basis for their own MPI implementations. This session discusses the current development activity with MPICH, including a close collaboration with teams at Intel. We showcase preparing MPICH-derived implementations for deployment on upcoming supercomputers like Aurora (from the Argonne Leadership Computing Facility), which is based on the Intel® Xeon Phi™ processor and Intel® Omni-Path Architecture (Intel® OPA).
Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.
Instruction Level Parallelism and Superscalar ProcessorsSyed Zaid Irshad
Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently
Equally applicable to RISC & CISC
In practice usually RISC
Faster microprocessor design presentation in American International University-Bangladesh (AIUB). Presentation was taken under the subject "SELECTED TOPICS IN ELECTRICAL AND ELECTRONIC ENGINEERING (PROCESSOR AND DSP HARDWARE DESIGN WITH SYSTEM VERILOG, VHDL AND FPGAS) [MEEE]", as a final semester student of M.Sc at AIUB.
Maximizing Throughput and Reliability with Pipelined TasksPhillip Koza
How do you concurrently execute long running tasks that process a high volume of data or require complex processing? To achieve maximum throughput, you need to process these tasks as a series of cooperating, dependent stages, i.e. you need to pipeline your tasks. This session describes how to design and manage a thread pipeline that achieves maximum throughput, while avoiding the common pitfalls of cooperating threads: blocking, hanging, deadlock, etc.
The intended audience is Java technology developers who want to increase their application performance.
Key points:
• When and how to pipeline
• Starting and stopping pipeline stages
• Stage cooperation and communication
• Bulletproof handling for thread failures
Adding Real-time Features to PHP ApplicationsRonny López
It's possible to introduce real-time features to PHP applications without deep modifications of the current codebase.
Using WAMP you can build distributed systems out of application components which are loosely coupled and communicate in (soft) real-time.
There is no need to learn a whole new language, with the implications it has.
It also opens the door to write reactive, event-based, distributed architectures and to achieve easier scalability by distributing messages to multiple systems.
Very long instruction word or VLIW refers to a processor architecture designed to take advantage of instruction level parallelism
This type of processor architecture is intended to allow higher performance without the inherent complexity of some other approaches.
Stream processing in python with Apache Samza and BeamHai Lu
Apache Samza is the streaming engine being used at LinkedIn that processes around 2 trillion messages daily. A while back we announced Samza's integration with Apache Beam, a great success which leads to our Samza Beam API. Now an UPGRADE of our APIs - we're now supporting Stream Processing in Python! This work has made stream processing more accessible and enabled many interesting use cases, particularly in the area of machine learning. The Python API is based on our work of Samza runner for Apache Beam. In this talk, we will quickly review our work on Samza runner, and then how we extended it to support portability in Beam (Python specifically). In addition to technical and architectural details, we will also talk about how we bridged Python and Java ecosystems at LinkedIn with the Python API, together with different use cases.
checking dependencies between instructions to determine which instructions can be grouped together for parallel execution;
assigning instructions to the functional units on the hardware;
determining when instructions are initiated placed together into a single word.
Next Generation MPICH: What to Expect - Lightweight Communication and MoreIntel® Software
MPICH is a widely used, open-source implementation of the message passing interface (MPI) standard. It has been ported to many platforms and used by several vendors and research groups as the basis for their own MPI implementations. This session discusses the current development activity with MPICH, including a close collaboration with teams at Intel. We showcase preparing MPICH-derived implementations for deployment on upcoming supercomputers like Aurora (from the Argonne Leadership Computing Facility), which is based on the Intel® Xeon Phi™ processor and Intel® Omni-Path Architecture (Intel® OPA).
Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.
Instruction Level Parallelism and Superscalar ProcessorsSyed Zaid Irshad
Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently
Equally applicable to RISC & CISC
In practice usually RISC
Faster microprocessor design presentation in American International University-Bangladesh (AIUB). Presentation was taken under the subject "SELECTED TOPICS IN ELECTRICAL AND ELECTRONIC ENGINEERING (PROCESSOR AND DSP HARDWARE DESIGN WITH SYSTEM VERILOG, VHDL AND FPGAS) [MEEE]", as a final semester student of M.Sc at AIUB.
Maximizing Throughput and Reliability with Pipelined TasksPhillip Koza
How do you concurrently execute long running tasks that process a high volume of data or require complex processing? To achieve maximum throughput, you need to process these tasks as a series of cooperating, dependent stages, i.e. you need to pipeline your tasks. This session describes how to design and manage a thread pipeline that achieves maximum throughput, while avoiding the common pitfalls of cooperating threads: blocking, hanging, deadlock, etc.
The intended audience is Java technology developers who want to increase their application performance.
Key points:
• When and how to pipeline
• Starting and stopping pipeline stages
• Stage cooperation and communication
• Bulletproof handling for thread failures
Adding Real-time Features to PHP ApplicationsRonny López
It's possible to introduce real-time features to PHP applications without deep modifications of the current codebase.
Using WAMP you can build distributed systems out of application components which are loosely coupled and communicate in (soft) real-time.
There is no need to learn a whole new language, with the implications it has.
It also opens the door to write reactive, event-based, distributed architectures and to achieve easier scalability by distributing messages to multiple systems.
The objectives of Multithreaded Programming in Operating Systems are:
- To introduce the notion of a thread—a fundamental unit of CPU utilization that forms the basis of multithreaded computer systems.
- To discuss the APIs for the Pthreads, Windows, and Java thread libraries
- To explore several strategies that provide implicit threading.
- To examine issues related to multithreaded programming.
- To cover operating system support for threads in Windows and Linux.
2. Introduction
• Enhanced performance for applications is the only
reason to go to parallel computers
• Parallel systems (with multiple, low cost, commodity
microprocessors) can have significant cost advantage
over even the best single processor computer
• Application developer has to design and program
correct and efficient parallel code for multiprocessors
• Result is the combined effect of application
performance and price performance
3. Introduction : Performance with OpenMP
• OpenMP provides the option for incremental
parallelization
• Ability to parallelize an application at a rate where the
developers feel additional effort is necessary and
provides the performance for the price
• MPI might have more of a large investment type
approach from the beginning
• Incremental parallelization can be done in different
versions of a code allowing a conservative development
path
• Suitable for apps that have been developed and tested
for many years
4. Introduction : OpenMP code
• OpenMP codes are closer to sequential codes
– What needs to be done
– Input/output
– Algorithms designed to do the work
– In case of parallel program how the work is distributed among
processors
• OpenMP supports the final step
• OpenMP works with Fortran and C/C++ (C++ might be
limited)
• OpenMP is a set of compiler directives to describe
parellelism in the source code
• OpenMP also has library routines
5. Introduction : First OpenMP code
program hello
print *, “Hello from threads:”
!$OMP parallel
print *, omp_get_thread_num()
!$OMP end parallel
print*, “Back to sequential:”
end
6. Introduction : OpenMP code
• Set OMP_NUM_THREADS to 4
• Within OpenMP directives three additional copies of
the code are started (what does this mean?)
• Each copy is called a thread or thread of execution
• The OpenMP routine omp_get_thread_num() reports
unique thread# between 0 and
OMP_NUM_THREADS-1
7. Introduction : Output of code
Output after running on 4 threads :
Hello from threads:
1
0
3
2
Back to sequential:
Analysis of OpenMP output :
Threads are working completely independent
Threads may be forced to cooperate to produce correct
and efficient results - leads to synchronization
8. Introduction : OpenMP Parallel Computers
• Primarily for shared memory parallel computers
• All processors are able to directly access all of the
memory
• UMA/SMPs (Compaq AlphaServers, all multiprocessor
PC and workstations, SUN Enterprise) and distributed
shared memory (DSM) or ccNUMA (SGI Origin 2000,
HP V-Class)
• SMPs are ~32 procs and ccNUMAs can go upto from
100s to even 1000s procs
• OpenMP codes usually result in few % to at most 20%
increase in code size compare to sequential codes ( MPI
codes can result in 50% to few 100% increase in size)
9. OpenMP Partners
• OpenMP web site at http://www.openMP.org
The OpenMP Architecture Review Board (1997) is comprised of the following organizations.
• Compaq
• Hewlett-Packard Company
• Intel Corporation
• International Business Machines (IBM)
• Kuck & Associates, Inc. (KAI)
• Silicon Graphics, Inc.
• Sun Microsystems, Inc.
• U.S. Department of Energy ASCI program
The following software vendors also endorse the OpenMP API:
• Absoft Corporation
• Edinburgh Portable Compilers
• Etnus, Inc.
• GENIAS Software GmBH
• Myrias Computer Technologies, Inc.
• The Portland Group, Inc. (PGI)
10. Summary of OpenMP Introduction
• MPI has its advantages and disadvantages
• Pthreads is an accepted standard for shared memory in
the low end
– Not for HPC
– Little support for Fortran
– More suitable for task than data parallel
• OpenMP : incremental parallelization
• OpenMP : compiler directives and library calls
• Directive allows to write portable codes since they are
ignored by non-openMP compilers
12. OpenMP: Fortran
• OpenMP compiler directives :
12345 6 (columns)
!$omp <directive>
C$omp <directive>
*$omp <directive>
must contain a space or zero in the 6th
column
• Treated as OpenMP directive by an OpenMP compiler
and treated as a comment by non-openMP compilers
• In fixed form a line begins with !$omp; in free format
in any column we have !$omp preceded only by white
spaces
• Continuation expressed as :
!$omp <directive>&
13. OpenMP : C
• #pragma omp
• In general make sure conditional compilation is used
with care
14. OpenMP: Programming Model
• Fork/Join parallelism
– Master thread spawns threads as needed
– Parallelism is added incrementally i.e. the sequential program
evolves into a parallel program
Master thread
Parallel regions
15. • Two basic kinds of constructs for controlling
parallelism :
• Directive to create multiple threads of execution that
execute concurrently
– Used to execute multiple structured blocks concurrently
– “parallel” directive
• Directive to divide work among existing set of parallel
threads
– Used to do loop level parallelism
– “do” in fortran and “for” in C
16. • OpenMP begins with single thread of control that has
the execution context or data environment : master
thread
• The execution context is the data address space
containing all the variables in the program : all the
global, subroutine variables (allocated on the stack) and
dynamically allocated variables (in the heap)
• Master thread exists for the duration of the entire
program
• During parallel construct new threads of execution are
created
OpenMP Data Sharing
17. • Each thread has its own stack within its execution
context, hence multiple threads can individually invoke
subroutines and execute safely without interfering the
stack frames of other threads
• For other program variables OpenMP parallel construct:
– Can share a single copy between all the threads
– Can provide each thread with its own copy
• Same variable can be shared within one parallel
construct and private in another
18. Shared Data
• A shared variable will have single storage location in
the memory for the entire duration of that parallel
construct
• All the threads will access the same memory location
• Read/Write operations will allow communication
between multiple OpenMP threads
19. Private Data
• A variable that has private scope will have multiple
storage locations
• Execution context of each thread will have a copy of the
variable for the duration of the parallel construct
• All read/write operations on that variable by a thread
will refer to the private copy
• This memory location is inaccessible to other threads
20. Reduction Data
• reduction variables have both private and shared
behavior
• These variables are the target of an arithmetic operation
• Example is final summation of temporary local
variables at the end of a parallel construct
21. Synchronization
• Multiple OpenMP threads communicate with each other
through ordinary reads and writes
• Coordination is necessary so that they don’t
simultaneously attempt to modify variables or read
when a variable is being modified
– This can lead to incorrect results without warning
– This can also produce different results in different runs
• Mutual exclusion : critical directive allows a thread
exclusive access to a shared variable for the duration of
the construct
• Event synchronization : barrier directive signals
occurrence of an event across multiple threads
• There are other synchronization constructs
22. Simple Loop Parallelization
subroutine sub1(z, a, x, y, n)
integer i , n
real z(n), a, x(n), y
do i = 1, n
z(i) = a*x(i) + y
enddo
return
end
• No dependences in the above loop : result of one
iteration doesn’t depend on result of any other iteration
• 2 procs can simultaneously execute two iterations
• Use parallel do
23. Fortran Parallel Loop
subroutine sub1(z, a, x, y, n)
integer i , n
real z(n), a, x(n), y
!$omp parallel do
do i = 1, n
z(i) = a*x(i) + y
enddo
return
end
• Directive followed by do loop construct says to execute the iterations
concurrently across multiple threads
• An openMP compiler creates multiple threads and distributes the
iterations of the loop across threads for parallel execution
24. C Parallel Loop
void routine (float z[n], float a, float x[n], float y, n) {
int i ;
#pragma omp parallel for
for ( i = 0 ; i <n ; i++ ) {
z[i] = a*x[i] + y ;
}
}
25. Execution Model
• OpenMP doesn’t specify:
– How threads are implemented
– How unique and distinct set of iterations are assigned to
threads
Master thread executes serial portion
Master thread enters the subroutine
Master thread sees parallel do directive
Master and worker threads concurrently do iterations
Implicit barrier : wait for all threads to finish iterations
Master thread continues; workers disappear
27. Data Sharing
Global shared
memory
z a x y n i
Parallel execution
(multiple threads)
Each thread has a
private copy of i
referenced
by each thread
i i i i
z,a,x,y,n references to global
shared
Initial i
undefined
After parallel construct
i is also undefined
28. Synchronization
• Synchronization for Z
– Multiple threads modify shared variable Z
– Each thread modifies distinct element of Z
– No data conflict; no explicit synchronization
• Master thread needs to see all updated value of Z after
the parallel loop
– Only master thread executes after parallel do/for
– Parallel do/for has implied barrier at the end for all threads
including master thread – guarantees that all iterations have
completed and all Z values updated
29. Simple Loop Parallelization
• Easy to express
• Can be used to parallelize large codes by incrementally
parallelizing individual loops
• Problems :
– Some apps may not have many loops
– Overhead of joining threads at the end of each loop – this is a
synchronization point and need to wait for the slowest one
30. More Loop Parallelization
real*8 x, y, x1, y1
integer i, j, m, n
integer distance(*,*)
integer function
……..
x1 = 10.
y1 = 20.
do i = 1, m
do j = 1, n
x = i / real(m)
y = j / real(n)
distance(i, j) = function(x, y, x1, y1)
enddo
enddo
31. • function takes x, y, x1 and y1 and calculates the distance
between two points (x,y) and (x1, y1)
• Are different iterations independent of each other ?
• Look at source code of function to see if thread safe
• Scalar variables j, x and y are assigned and function called in the
innermost loop
• parallel do will distribute iterations of the outermost loop among
threads
• By default i is private and everything else is shared
• m and n are only read – ok to have them shared
• Loop index j needs to be private – why ?
• x and y need to be private also – why ?
• distance is modified inside the loop – synchronization
• No explicit sync required due to implicit barrier of parallel do
32. Parallel code
………
x1 = 10.
y1 = 20.
!$omp parallel do private(j, x, y)
do i = 1, m
do j = 1, n
x = i / real(m)
y = j / real(n)
distance(i, j) = function(x, y, x1, y1)
enddo
enddo
33. Synchronization
x1 = 10.
y1 = 20.
total_dist = 0.
do i = 1, m
do j = 1, n
x = i / real(m)
y = j / real(n)
distance(i, j) = function(x, y, x1, y1)
total_dist = total_dist + distance(i,j)
enddo
enddo
How does parallelization change ?
34. • Cannot just make total_dist shared
• total_dist needs to be shared but total_dist (a shared
variable) is modified in the parallel portion of the code
• Multiple threads write to total_dist – no guaranteed
ordered among multiple threads reads/writes – many
possibilities of wrong result
• This is race condition on access to a shared variable
• Access needs to be controlled through synchronization
• critical/end critical construct can be executed by one
thread at a time
• First thread to reach critical executes code – all other
wait until current thread is done
• One thread at a time updates value total_dist
35. x1 = 10.
y1 = 20.
total_dist = 0.
!$omp parallel do private(j, x, y)
do i = 1, m
do j = 1, n
x = i / real(m)
y = j / real(n)
distance(i, j) = function(x, y, x1, y1)
!$omp critical
total_dist = total_dist + distance(i,j)
!$omp end critical
enddo
enddo
( Explicit synchronization - inserted by programmer)
36. Reduction clause
• critical section protects shared variable total_dist
• Basic operation was sum reduction
• Reductions are common enough that openMP has reduction data
scope clause
total_dist = 0.
!$omp parallel do private(j, x, y)
!$omp& reduction (+:total_dist)
do i = 1, m
do j = 1, n
x = i / real(m)
y = j / real(n)
distance(i, j) = function(x, y, x1, y1)
total_dist = total_dist + distance(i,j)
enddo
enddo
37. • Reduction clause tells compiler that total_dist is the
target of a sum reduction
• Many other mathematical operations possible in
reduction
• Compiler and runtime environment will implement the
reduction in an efficient manner for the target machine
• Reduction is a data attribute distinct from either shared
or private and has elements of both shared and private
data