SlideShare a Scribd company logo
1 of 97
A Guide To OpenMP
                                                     A Guide to OpenMP




                             Presented By:
                       Dr. Barbara Chapman

                              Created By:
                            Brett Estrade

                           HPCTools Group
                         University of Houston
                    Department of Computer Science

                    http://www.cs.uh.edu/~hpctools
An Example Shared Memory System & Cache Hierarchy
                                                                 A Guide to OpenMP




           cpu 0                                         cpu 2

                  L1                                L1



             L2              Main Memory                  L2



                  L1                                L1

           cpu 1                                         cpu 3
Multiple Processes on this System
                                                A Guide to OpenMP




                        cpu 0       process 0


                  p0


                        cpu 1       process 1
             p1
                   Main Memory

             p2
                        cpu 2       process 2

                  p3



                        cpu 3       process 3
Multiple Processes on this System – this is MPI's model
                                                          A Guide to OpenMP




                         cpu 0          process 0


                   p0    recv    send



                         cpu 1          process 1
              p1

                         recv    send


              p2
                         cpu 2          process 2

                   p3
                         recv    send




                         cpu 3          process 3
A Single Process with Multiple Threads on this System
                                                        A Guide to OpenMP




                        cpu 0         Thread 0



                 p0


                        cpu 1         Thread 1


                   Main Memory
                                 process 0


                        cpu 2         Thread 2




                        cpu 3       process 33
                                     Thread
OpenMP's Execution Model – Fork & Join
                                                                           A Guide to OpenMP




                                   Main Memory



                         t0    cpu 0 Do stuff              t0

    cpu 0                      cpu 1 Do stuff
                         t1                                t1
      t0      F                                                        J         t0
                               cpu 2
  process 0              t2            Do stuff            t2

                               cpu 3
                         t3            Do stuff            t3




              Here, thread 0 is on CPU 0; at the fork, 3 new threads are
              created and are distributed to the remaining 3 CPUs.
An Example of Sharing Memory
                                                    A Guide to OpenMP




                      cpu 0             Thread 0



               p0              int a


                      cpu 1             Thread 1


                 Main Memory
                                process 0


                      cpu 2             Thread 2




                      cpu 3            process 33
                                        Thread
An Example of Sharing Memory
                                                       A Guide to OpenMP




                          cpu 0            Thread 0

                         write
                  p0              int a


                          cpu 1            Thread 1


                    Main Memory
                                   process 0


                          cpu 2            Thread 2




                          cpu 3           process 33
                                           Thread



 Thread 0 writes integer a.
An Example of Sharing Memory
                                                          A Guide to OpenMP




                             cpu 0            Thread 0

                         write
                  p0                 int a
                                read

                             cpu 1            Thread 1


                    Main Memory
                                       process 0


                             cpu 2            Thread 2




                             cpu 3           process 33
                                              Thread



 Thread 1 reads integer a.
An Example of Sharing Memory
                                                          A Guide to OpenMP




                             cpu 0            Thread 0

                         write
                  p0                 int a
                                read

                             cpu 1            Thread 1

              read
                     Main Memory
                                       process 0


                             cpu 2            Thread 2




                             cpu 3           process 33
                                              Thread



 Thread 2 reads integer a.
An Example of Sharing Memory
                                                          A Guide to OpenMP




                             cpu 0            Thread 0

                         write
                  p0                 int a
                                read

                             cpu 1            Thread 1

              read
                     Main Memory
                                       process 0


                             cpu 2            Thread 2

               read



                             cpu 3           process 33
                                              Thread



 Thread 3 reads integer a.
Notes on the Example of Sharing Memory
                                                                A Guide to OpenMP




    ●   Threads execute their code independently
    ●   So usually Read/Write order matters
    ●   If the order is not specifed in some way, this could represent
        a (data) race condition
    ●   Race conditions introduce non-determinism (not good)


    ●   Threaded programs can be extremely difcult to debug
    ●   Proper precautions must be made to eliminate data races
What is OpenMP?
                                                                      A Guide to OpenMP




   ●   An industry standard for shared memory parallel programming
            –   OpenMP Architecture Review Board
            –   AMD, Intel, IBM, HP, Microsoft, Sun/Oracle, Fujitsu, NEC,
                  Texas Instruments, PGI, CAPS; LLNL, ORNL, ANL, NASA,
                  cOMPunity,..
            –   http://www.openmp.org/
   ●   A set of directives for describing parallelism in an application code
   ●   A user-level API and runtime environment
   ●   A widely supported standard set of parallel programming pragmas
       with bindings for Fortran, C, & C++
   ●   A community of active users & researchers
The Anatomy of an OpenMP Program
                                                                               A Guide to OpenMP




                    #include <stdio.h>
                    #include <omp.h>
  parallel (fork)   int main (int argc, char *argv[]) {               runtime function
  directive           int tid, numt;
                      numt = omp_get_num_threads();
                    #pragma omp parallel private(tid) shared(numt)
                      {
                         tid = omp_get_thread_num();
                         printf("hi, from %dn", tid);               clauses
structured          #pragma omp barrier
                         if ( tid == 0 ) {                           directive
parallel block             printf("%d threads say hi!n",numt);
                         }                                           (thread barrier)
                      }
                        return 0;
                    }
OpenMP is Really a Standard Specifcation
                                                  A Guide to OpenMP




      The timeline of the OpenMP Standard Specifcation
OpenMP vs MPI
                                                                   A Guide to OpenMP




                                       There is no silver bullet




                And that makes Teen Wolf happy
Some Benefts of using OpenMP
                                                               A Guide to OpenMP




   ●   It's portable, supported by most C/C++ & Fortran compilers
   ●   Much of serial code can be left untouched (most of the time)
   ●   The development cycle is a friendly one
            –   Can be introduced iteratively into existing code
            –   Correctness can be verifed along the way
            –   Likewise, performance benefts can be gauged
   ●   Optimizing memory access in the serial program will beneft
       the threaded version (can help avoid false sharing, etc)
   ●   It can be fun to use (immediate gratifcation)
What Does OpenMP Provide?
                                                            A Guide to OpenMP




   ●   More abstract programming interface than low level
       thread libraries
   ●   User specifes strategy for parallel execution, but not all
       the details
   ●   Directives, written as structured comments
   ●   A runtime library that manages execution dynamically
   ●   Control via environment variables & the runtime API
   ●   Expectations of behavior & sensible defaults
   ●   A promise of interface portability;
What Compilers Support OpenMP?
                                                                            A Guide to OpenMP




  Vendor           Languages                   Supported Specification
  IBM              C/C++(10.1),Fortran(13.1)   Full 3.0 support
  Sun/Oracle       C/C++,Fortran(12.1)         Full 3.0 support
  Intel            C/C++,Fortran(11.0)         Full 3.0 support
  Portland Group   C/C++,Fortran               Full 3.0 support
  Absoft           Fortran(11.0)               Full 2.5 support
  Lahey/Fujitsu    C/C++,Fortran(6.2)          Full 2.0 support
  PathScale        C/C++,Fortran               Full 2.5 support (based on Open64)
  HP               C/C++,Fortran               Full 2.5 support
  Cray             C/C++,Fortran               Full 3.0 on Cray XT Series Linux
  GNU              C/C++,Fortran               Working towards full 3.0
  Microsoft        C/C++,Fortran               Full 2.0
OpenMP Research Activities
                                                              A Guide to OpenMP




    ●   A lot of research has gone into the OpenMP standard
    ●   International Workshop on OpenMP (IWOMP)
    ●   Suites: validation, NAS, SPEC, EPCC, BOTS
    ●   Open Source Research Compilers:
            –   OpenUH
            –   NANOS
            –   Rose/{OMNI,GCC}
            –   MPC, etc
            –   Commercial R&D
    ●   cOMPunity - http://www.compunity.org
    ●   Applications work with new features
Compiling and Executing Examples
                                                                      A Guide to OpenMP




    ●    IBM XL Suite:
          –    xlc_r, xlf90, etc

                  % xlc_r -qsmp=omp test.c -o test.x   # compile it
        bash      % OMP_NUM_THREADS=4 ./test.x         # execute it

    ●    OpenUH:
          ●    uhcc, uhf90, etc

                 % uhcc -mp test.c -o test.x           # compile it
        bash     % OMP_NUM_THREADS=4 ./test.x          # execute it
OpenMP Directives & Constructs
                                                            A Guide to OpenMP




    ●   Contained inside structured comments
        C/C++:
               #pragma omp <directive> <clauses>
        Fortran:
               !$OMP <directive> <clauses>
    ●   OpenMP compliant compilers fnd and parse directives
    ●   Non-compliant should safely ignore them as comments
    ●   A construct is a directive that afects the enclosing code
    ●   Imperative (standalone) directives exist
    ●   Clauses control the behavior of directives
    ●   Order of clauses has no bearing on efect
Summary of OpenMP's Directives and Key Clauses
                                                                         A Guide to OpenMP




    ●   Forking Threads                       ●   Synchronization
                 parallel                                barrier
    ●   Distributing Work                                flush
                 for    (C/C++)                          ordered

                 DO   (Fortran)
                                                         taskwait

                 sections/section
                                              ●   Asynchronous Tasking
                 WORKSHARE        (Fortran)              task
    ●   Singling Out Threads                  ●   Data Environment
                 single                                  shared

                 Master                                  private
                                                         threadprivate
    ●   Mutual Exclusion
                                                         reduction
             –   critical
             –   atomic
The OpenMP “Runtime” Library (RTL)
                                                                         A Guide to OpenMP




    ●   The “runtime” manages the multi-threaded execution:
             –   It's used by the resulting executable OpenMP program
             –   It's what spawns threads (e.g., calls pthreads)
             –   It's what manages shared & private memory
             –   It's what distributes (shares) work among threads
             –   It's what synchronizes threads & tasks
             –   It's what reduces variables and keeps lastprivate
             –   It's what is infuenced by envars & the user level API
    ●   Doxygen docs of OpenUH's OpenMP RTL, libopenmp
    ●   The Doxygen call graph for __ompc_fork in libopenmp/omp_threads.c

           http://www2.cs.uh.edu/~estrabd/OpenUH/r593/html-libopenmp/
_Call Graph of __ompc_fork in libopenmp/threads.c
                                                    A Guide to OpenMP
Useful OpenMP Environment Variables
                                      A Guide to OpenMP




    ●   OMP_NUM_THREADS
    ●   OMP_SCHEDULE
    ●   OMP_DYNAMIC
    ●   OMP_STACKSIZE
    ●   OMP_NESTED
    ●   OMP_THREAD_LIMIT
    ●   OMP_MAX_ACTIVE_LEVELS
    ●   OMP_WAIT_POLICY
3 Types of Runtime Functions
                                                         A Guide to OpenMP




      Execution environment routines; e.g.,
            –   omp_{set,get}_num_threads
            –   omp_{set,get}_dynamic
            –   Each envar has a corresponding get/set
      Locking routines (generalized mutual exclusion); e.g.,
            –   omp_{init,set,test,unset,destroy}_lock
            –   omp_{...}_nest_lock
      Timing routines; e.g.,
            –   omp_get_wtime
            –   omp_get_wtick
How Is OpenMP Implemented?
                                                                A Guide to OpenMP




   ●   A compiler translates OpenMP
           –   It processes the directives and uses them to create
                  explicitly multithreaded code
           –   It translates the rest of the Fortran/C/C++ more or
                  less as usual
           –   Each compiler has its own strategy
   ●   The generated code makes calls to the runtime library
   ●   We have already seen what the runtime is typically
       responsible for
           –   The runtime library (RTL) also implements OpenMP's
                 user-level runtime routines
           –   Each compiler has its own custom RTL
How Is an OpenMP Program Compiled? Here's How OpenUH does it.
                                                                A Guide to OpenMP




   Parse into Very High
    WHIRL

   High level, serial program
   optimizations

   Put OpenMP constructs
   into standard format

   Loop level serial program
   optimizations

   Transform OpenMP into
   threaded code

   Optimize, now, potentially,
   threaded code

   Output native instr. or
   interface with native
   compiler via source to
   source translation




 Liao, et. al.: http://www2.cs.uh.edu/~copper/openuh.pdf
What Does the Transformed Code Look like?
                                                                                                     A Guide to OpenMP




     ●   Intermediate code,“W2C” - WHIRL to C
                  –   uhcc -mp -gnu3 -CLIST:emit_nested_pu simple.c
                  –   http://www2.cs.uh.edu/~estrabd/OpenMP/simple/
                                                     static void __omprg_main_1(__ompv_gtid_a, __ompv_slink_a)
                                                       _INT32 __ompv_gtid_a;
                                                       _UINT64 __ompv_slink_a;
                                                     {
 #include <stdio.h>                                    register _INT32 _w2c___comma;
 int main(int argc, char *argv[]) {                    _UINT64 _temp___slink_sym0;
   int my_id;                                          _INT32 __ompv_temp_gtid;
 #pragma omp parallel default(none) private(my_id)     _INT32 __mplocal_my_id;
   {
     my_id = omp_get_thread_num();                     /*Begin_of_nested_PU(s)*/
     printf("hello from %dn",my_id);
   }                                                   _temp___slink_sym0 = __ompv_slink_a;
   return 0;                                           __ompv_temp_gtid = __ompv_gtid_a;
 }                                                     _w2c___comma = omp_get_thread_num();
                                                       __mplocal_my_id = _w2c___comma;
                                                       printf("hello from %dn", __mplocal_my_id);
                                                       return;
          The original main()                        } /* __omprg_main_1 */




                                                        main is outlined to __omprg_main_1()
The new main()
                                                                               A Guide to OpenMP
  extern _INT32 main() {
    register _INT32 _w2c___ompv_ok_to_fork;
    register _UINT64 _w2c_reg3;
    register _INT32 _w2c___comma;
    _INT32 my_id;
    _INT32 __ompv_gtid_s1;
                                                        calls RTL fork and passes
                                                        function pointer to outlined
    /*Begin_of_nested_PU(s)*/                           main()
    _w2c___ompv_ok_to_fork = 1;
    if(_w2c___ompv_ok_to_fork)
    {                                                        __omprg_main_1's
      _w2c___ompv_ok_to_fork = __ompc_can_fork();
    }                                                        frame pointer
    if(_w2c___ompv_ok_to_fork)
    {
      __ompc_fork(0, &__omprg_main_1, _w2c_reg3);
    }
    else
    {
      __ompv_gtid_s1 = __ompc_get_local_thread_num();
      __ompc_serialized_parallel();
      _w2c___comma = omp_get_thread_num();
      my_id = _w2c___comma;                                   serial version
      printf("hello from %dn", my_id);
      __ompc_end_serialized_parallel();
    }
    return 0;
  } /* main */
               Nobody wants to code like this, so we let the
               compiler and runtime do most of this tedious work!
A Guide to OpenMP




Programming with OpenMP 3.0
The parallel Construct
                                                                      A Guide to OpenMP




    ●   Where the “fork” occurs (e.g.,__ompc_fork(...))
    ●   Encloses all other OpenMP constructs & directives
    ●   This construct accepts the following clauses: if,
        num_threads, private, firstprivate, shared,
        default, copyin, reduction
    ●   Can call functions that contain “orphan” constructs
            –   Statically (lexically) outside parallel region, but
                  dynamically inside during runtime
    ●   Can be nested
A Simple OpenMP Example
                                                                                  A Guide to OpenMP




 C/C++
                                                   get number of threads
 #include <stdio.h>
 #include <omp.h>

 int main (int argc, char *argv[]) {
                                                   fork
   int tid, numt;
   numt = omp_get_num_threads();
 #pragma omp parallel private(tid) shared(numt)    get thread id
   {
      tid = omp_get_thread_num();
      printf("hi, from %dn", tid);                wait for all threads
 #pragma omp barrier
      if ( tid == 0 ) {
        printf("%d threads say hi!n",numt);
      }                                            join (implicit barrier, all wait)
   }
     return 0;
 }
                                  Output using 4 threads:

                                  hi, from 3
                                  hi, from 0
                                  hi, from 2
                                  hi, from 1
                                  4 threads say hi!

                        Note, thread order
                        not guaranteed!
The Fortran Version
                                                                                       A Guide to OpenMP




  C/C++                                            F90
  #include <stdio.h>                                     program hello90
  #include <omp.h>                                       use omp_lib
                                                         integer:: tid, numt
  int main (int argc, char *argv[]) {                    numt = omp_get_num_threads()
    int tid, numt;                                 !$omp parallel private(id) shared(numt)
    numt = omp_get_num_threads();                        tid = omp_get_thread_num()
  #pragma omp parallel private(tid) shared(numt)         write (*,*) 'hi, from', tid
    {                                              !$omp barrier
       tid = omp_get_thread_num();                       if ( tid == 0 ) then
       printf("hi, from %dn", tid);                       write (*,*) numt,'threads say hi!'
  #pragma omp barrier                                    end if
       if ( tid == 0 ) {                           !$omp end parallel
         printf("%d threads say hi!n",numt);            end program
       }
    }
      return 0;
  }
                                   Output using 4 threads:

                                   hi, from 3
                                   hi, from 0
                                   hi, from 2
                                   hi, from 1
                                   4 threads say hi!

                         Note, thread order
                         not guaranteed!
Now, Just the Parallelized Code
                                                                                       A Guide to OpenMP




  C/C++                                            F90
  #include <stdio.h>                                     program hello90
  #include <omp.h>                                       use omp_lib
                                                         integer:: tid, numt
  int main (int argc, char *argv[]) {                    numt = omp_get_num_threads()
   int tid, numt;                                  !$omp parallel private(id) shared(numt)
   numt = omp_get_num_threads();                         tid = omp_get_thread_num()
  #pragma omp parallel private(tid) shared(numt)         write (*,*) 'hi, from', tid
   {                                               !$omp barrier
      tid = omp_get_thread_num();                        if ( tid == 0 ) then
      printf("hi, from %dn", tid);                        write (*,*) numt,'threads say hi!'
  #pragma omp barrier                                    end if
      if ( tid == 0 ) {                            !$omp end parallel
        printf("%d threads say hi!n",numt);             end program
      }
   }
     return 0;
  }}
                                   Output using 4 threads:

                                   hi, from 3
                                   hi, from 0
                                   hi, from 2
                                   hi, from 1
                                   4 threads say hi!

                         Note, thread order
                         not guaranteed!
Trace of The Execution
                                                                                              A Guide to OpenMP


              all threads call printf             only thread with
                                                  tid == 0 does this

                                                                                                  join
                                                       0 == 0
          0      hi, from 0           B                                                  0
 fork                                             4 threads say hi!


          1      hi, from 1           B              1 != 0                              1

  F                                                                                                     J
          2      hi, from 2           B               2 !=0                              2



          3      hi, from 3           B              3 != 0                              3


                                                                                  other threads wait

                     thread barrier       B = wait for all threads @ barrier before progressing further.
Controlling the Fork at Runtime
                                                             A Guide to OpenMP




    ●   The “if” clause contains a conditional expression.
    ●   If TRUE, forking occurs, else it doesn't
                     int n = some_func();
                     #pragma omp parallel if(n>5)
                       {
                         … do stuff in parallel
                       }

    ●   The “num_threads” clause is another way to control the
        number of threads active in a parallel contruct

                     int n = some_func();
                     #pragma omp parallel num_threads(n)
                       {
                         … do stuff in parallel
                       }
The Data Environment Among Threads
                                                                            A Guide to OpenMP




    ●   default([shared]|none|private)
    ●   shared(list,) -   supported by parallel construct only

    ●   private(list,)
    ●   firstprivate(list,)
    ●   lastprivate(list,) -       supported by loop & sections constructs only

    ●   reduction(<op>:list,)
    ●   copyprivate(list,) -       supported by single construct only

    ●   threadprivate –   a standalone directive, not a clause

             #pragma omp threadprivate(list,)
             !$omp threadprivate(list,)
    ●   copyin(list,) -   supported by parallel construct only
Private Variable Initialization
                                                                        A Guide to OpenMP




    ●   private(list,)
             –   Initialized value of variable(s) is undefned
    ●   firstprivate(list,)
             –   private variables are initialized with value of corresponding
                   variable from master thread at time of fork
    ●   copyin(list,)
             –   Copy value of master's threadprivate variables to
                   corresponding variables of the other threads
    ●   threadprivate(list,)
             –   Global lifetime objects are replicated so each thread has its
                   own copy
             –   static variables in C/C++, COMMON blocks in Fortran
Getting Data Out
                                                                      A Guide to OpenMP




    ●   copyprivate(list,)
           –   Used by single thread to pass list values to corresponding
                 private vars in the other, non-executing, threads
    ●   lastprivate(list,)
           –   vars in list will be given the last value assigned to it by a
                 thread
           –   supported by loop & sections construct
    ●   reduction(<op>:list,)
           –   aggregates vars in list using the defned operation
           –   supported by parallel, loop, & sections constructs
           –   <op> must be an actual operator or an intrinsic function
private & shared in that Simple OpenMP Example
                                                                                      A Guide to OpenMP




 C/C++                                            F90
 #include <stdio.h>                                     program hello90
 #include <omp.h>                                       use omp_lib
                                                        integer:: tid, numt
 int main (int argc, char *argv[]) {                    numt = omp_get_num_threads()
   int tid, numt;                                 !$omp parallel private(tid) shared(numt)
   numt = omp_get_num_threads();                        tid = omp_get_thread_num()
 #pragma omp parallel private(tid) shared(numt)         write (*,*) 'hi, from', tid
   {                                              !$omp barrier
      tid = omp_get_thread_num();                       if ( tid == 0 ) then
      printf("hi, from %dn", tid);                       write (*,*) numt,'threads say hi!'
 #pragma omp barrier                                    end if
      if ( tid == 0 ) {                           !$omp end parallel
        printf("%d threads say hi!n",numt);            end program
      }
   }
     return 0;
 }
                                  Output using 4 threads:

                                  hi, from 3
                                  hi, from 0
                                  hi, from 2
                                  hi, from 1
                                  4 threads say hi!

                        Note, thread order
                        not guaranteed!
Memory Consistency Model
                                                          A Guide to OpenMP




   ●   OpenMP uses a “relaxed consistency” model
   ●   In contrast especially to sequential consistency
   ●   Cores may have out of date values in their cache
   ●   Most constructs imply a “flush” of each thread's cache
   ●   Treated as a memory “fence” by compilers when it
       comes to reordering operations
   ●   OpenMP provides an explicit fush directive
             #pragma flush (list,)
             !$OMP FLUSH(list,)
Synchronization
                                                           A Guide to OpenMP




    ●   Explict sync points are enabled with a barrier:
                #pragma omp barrier
                !$omp barrier
    ●   Implicit sync points exist at the end of:
            –   parallel, for, do, sections, single,
                  WORKSHARE
    ●   Implicit barriers can be turned of with “nowait”
    ●   There is no barrier associated with:
            –   critical, atomic, master
    ●   Explicit barriers must be used when ordering is required
        and otherwise not guaranteed
An explicit barrier in that Simple OpenMP Example
                                                                                      A Guide to OpenMP




  C/C++                                           F90
 #include <stdio.h>                                     program hello90
 #include <omp.h>                                       use omp_lib
                                                        integer:: tid, numt
 int main (int argc, char *argv[]) {                    numt = omp_get_num_threads()
   int tid, numt;                                 !$omp parallel private(id) shared(numt)
   numt = omp_get_num_threads();                        tid = omp_get_thread_num()
 #pragma omp parallel private(tid) shared(numt)         write (*,*) 'hi, from', tid
   {                                              !$omp barrier
      tid = omp_get_thread_num();                       if ( tid == 0 ) then
      printf("hi, from %dn", tid);                       write (*,*) numt,'threads say hi!'
 #pragma omp barrier                                    end if
      if ( tid == 0 ) {                           !$omp end parallel
        printf("%d threads say hi!n",numt);            end program
      }
   }
     return 0;
 }

                                  Output using 4 threads:

                                  hi, from 3
                                  hi, from 0
                                  hi, from 2
                                  hi, from 1
                                  <barrier>
                                  4 threads say hi!
Trace of The Execution
                                                                                       A Guide to OpenMP




                                                0 == 0
          0    hi, from 0      B                                                  0
                                           4 threads say hi!


          1    hi, from 1      B              1 != 0                              1

  F                                                                                              J
          2    hi, from 2      B               2 !=0                              2



          3    hi, from 3      B              3 != 0                              3




         #pragma omp barrier
                                   B = wait for all threads @ barrier before progressing further.
The reduction Clause
                                                           A Guide to OpenMP




   ●   Supported by parallel and worksharing constructs
           –   parallel, for, do, sections
   ●   Creates a private copy of a shared var for each thread
   ●   At the end of the construct containing the reduction
       clause, all private values are reduced into one using the
        specifed operator or intrinsic function
               #pragma omp parallel reduction(+:i)
               !$omp parallel reduction(+:i)
A reduction Example
                                                                                      A Guide to OpenMP




 C/C++                                            F90
 #include <stdio.h>                                     program hello90
 #include <omp.h>                                       use omp_lib
                                                        integer:: t, i, numt
 int main (int argc, char *argv[]) {                    i = 0
   int t, i;                                      !$omp parallel private(t) reduction(+:i)
   i = 0;                                               t = omp_get_thread_num()
 #pragma omp parallel private(t) reduction(+,i)         i = t + 1;
   {                                                    write (*,*) 'hi, from', t
      t = omp_get_thread_num();                   !$omp barrier
      i = t + 1;                                        if ( t == 0 ) then
      printf("hi, from %dn", t);                         numt = omp_get_num_threads()
 #pragma omp barrier                                      write (*,*) numt,'threads say hi!'
      if ( t == 0 ) {                                   end if
        int numt = omp_get_num_threads();         !$omp end parallel
        printf("%d threads say hi!n",numt);            write (*,*) 'i is reduced to ', i
      }                                                 end program
   }
     printf(“i is reduced to %dn”,i);
     return 0;
 }
                                  Output using 4 threads:

                                  hi, from 3
                                  hi, from 0
                                  hi, from 2
                                  hi, from 1
                                  4 threads say hi!
                                  i is reduced to 10
Trace of a variable reduction
                                                                          A Guide to OpenMP




    a shared variable is
    initialized                          reduction occurs
                                         at end of the construct

                   0       i0 = 0 + 1

     i = 0

                   1       i1 = 1 + 1           i = i+i0+i1+i2+i3
                                                       ...
         F                                        i = 1+2+3+4                J
                                                       ...
                   2       i2 = 2 + 1                i = 10
                                                                          i = 10



                   3       i3 = 3 + 1
                                                              fnal value of i is
                                                              available after end
                                                              of construct
             each thread writes to its
             private variable
Valid Operations for the reduction Clause
                                                                           A Guide to OpenMP




    ●   Reduction operations in C/C++:
              –   Arithmetic: + - *
              –   Bitwise:      &^|
              –   Logical:      && ||
    ●
        Reduction operations in Fortran
              –   Equivalent arithmetic, bitwise, and logical operations
              –   min, max
    ●
        User defined reductions (UDR) is an area of current research
    ●
        Note: initialized value matters!
Nested parallel Constructs
                                                      A Guide to OpenMP




    ●   Specifcation makes this feature optional
            –   OMP_NESTED={true,false}
            –   OMP_MAX_ACTIVE_LEVELS={1,2,..}
            –   omp_{get,set}_nested()
            –   omp_get_level()
            –   omp_get_ancestor_thread_num(level)
    ●   Each encountering thread becomes the master of
        the newly forked team
    ●   Each subteam is numbered 0 through N-1
    ●   Useful, but still incurs parallel overheads
The Uniqueness of Thread Numbers in Nesting
                                                                                         A Guide to OpenMP



                      Thread numbers are not unique; paths to each thread are.



Nest Level 0

                                                   0    __ompc_fork(..)


Nest Level 1
                         __ompc_fork(..)     __ompc_fork(..)       __ompc_fork(..)
                                         2         1           0

Nest Level 2


     2            1           0          2         1           0            2        1        0


     <0,2,2>                                                   <0,0,2>
               Because of the tree structure, each thread can be uniquely
               identifed by its full path from the root of its sub-tree;

               This path tuple can be calculated in O(level) using omp_get_level and
               omp_get_ancestor_thread_num in combination.
Work Sharing Constructs
                                                         A Guide to OpenMP




    ●   Threads “share” work, access data in shared memory
    ●   OpenMP provides “work sharing” constructs
    ●   These specify the work to be distributed among the
        threads and state how the work is assigned to them
    ●   Data race conditions are illegal
    ●   They include:
            –   loops (for, DO)
            –   sections
            –   WORKSHARE (Fortran only)
            –   single, master
The Loop Constructs: for & DO
                                                           A Guide to OpenMP




    ●   The loop constructs distribute iterations among threads
        according to some schedule (default is static)
    ●   Among frst constructs used when introducing OpenMP
    ●   The clauses supported by the loop constructs are:
        private, firstprivate, lastprivate, reduction,
        schedule, order, collapse, nowait
    ●   The loop's schedule refers to the runtime policy used to
        distribute work among the threads.
OpenMP Parallelizes Loops by Distributing Iterations to Each Thread
                                                                              A Guide to OpenMP




                             int i;
                             #pragma omp for
                               for (i=0;i <= 99; i++) {
                                 // do stuff
                               }




  for (i=0;i <= 33; i++) {                                   for (i=68;i <= 99; i++) {
    // do stuff                                                // do stuff
  }                                                          }

        thread 0                 for (i=34;i <= 67; i++) {         thread 2
      i = 0 thru 33                // do stuff                  i = 68 thru 99
                                 }

                                       thread 1
                                    i = 34 thru 67
Parallelizing Loops - C/C++
                                                                                       A Guide to OpenMP

  #include <stdio.h>
  #include <omp.h>
  #define N 100

  int main(void)
  {
    float a[N], b[N], c[N];
    int i;
    omp_set_dynamic(0);          // ensures use of all available threads
    omp_set_num_threads(20);     // sets number of all available threads to 20
  /* Initialize arrays a and b. */
    for (i = 0; i < N; i++)
      {
        a[i] = i * 1.0;
        b[i] = i * 2.0;
      }
  /* Compute values of array c in parallel. */
  #pragma omp parallel shared(a, b, c) private(i)
     {
  #pragma omp for [nowait]
      for (i = 0; i < N; i++)
        c[i] = a[i] + b[i];
      }
    printf ("%fn", c[10]);
  }



                       http://developers.sun.com/solaris/articles/studio_openmp.html
Parallelizing Loops - C/C++
                                                                                       A Guide to OpenMP

  #include <stdio.h>
  #include <omp.h>
  #define N 100

  int main(void)
  {
    float a[N], b[N], c[N];
    int i;
    omp_set_dynamic(0);          // ensures use of all available threads
    omp_set_num_threads(20);     // sets number of all available threads to 20
  /* Initialize arrays a and b. */
    for (i = 0; i < N; i++)
      {
        a[i] = i * 1.0;
        b[i] = i * 2.0;
      }
  /* Compute values of array c in parallel. */
  #pragma omp parallel shared(a, b, c) private(i)
     {
  #pragma omp for [nowait]         “nowait” is optional
      for (i = 0; i < N; i++)
        c[i] = a[i] + b[i];
     }
    printf ("%fn", c[10]);
  }



                       http://developers.sun.com/solaris/articles/studio_openmp.html
Parallelizing Loops - Fortran
                                                                                      A Guide to OpenMP

         PROGRAM VECTOR_ADD
         USE OMP_LIB
         PARAMETER (N=100)
         INTEGER N, I
         REAL A(N), B(N), C(N)
         CALL MP_SET_DYNAMIC (.FALSE.) !ensures use of all available threads
         CALL OMP_SET_NUM_THREADS (20)   !sets number of available threads to 20
   ! Initialize arrays A and B.
         DO I = 1, N
            A(I) = I * 1.0
            B(I) = I * 2.0
         ENDDO
   ! Compute values of array C in parallel.
   !$OMP PARALLEL SHARED(A, B, C), PRIVATE(I)
   !$OMP DO
         DO I = 1, N
            C(I) = A(I) + B(I)
         ENDDO
   !$OMP END DO [nowait]
         ! ... some more instructions
   !$OMP END PARALLEL
         PRINT *, C(10)
         END




                      http://developers.sun.com/solaris/articles/studio_openmp.html
Parallelizing Loops - Fortran
                                                                                      A Guide to OpenMP

         PROGRAM VECTOR_ADD
         USE OMP_LIB
         PARAMETER (N=100)
         INTEGER N, I
         REAL A(N), B(N), C(N)
         CALL MP_SET_DYNAMIC (.FALSE.) !ensures use of all available threads
         CALL OMP_SET_NUM_THREADS (20)   !sets number of available threads to 20
   ! Initialize arrays A and B.
         DO I = 1, N
            A(I) = I * 1.0
            B(I) = I * 2.0
         ENDDO
   ! Compute values of array C in parallel.
   !$OMP PARALLEL SHARED(A, B, C), PRIVATE(I)
   !$OMP DO
         DO I = 1, N
            C(I) = A(I) + B(I)         “nowait” is optional & added
         ENDDO
   !$OMP END DO [nowait]               to the closing directive
         ! ... some more instructions
   !$OMP END PARALLEL
         PRINT *, C(10)
         END




                      http://developers.sun.com/solaris/articles/studio_openmp.html
Parallel Loop Scheduling
                                                                      A Guide to OpenMP




    ●   The scheduling is a description that states which loop
        iterations are assigned to a particular thread
    ●   There are 5 types:
               –   static – each thread calculates its own chunk
               –   dynamic – “frst come, frst served”, managed by runtime
               –   guided – dynamic, with decreasing chunk sizes
               –   auto – determined automatically by compiler or runtime
               –   runtime – defned by OMP_SCHEDULE or omp_set_schedule
    ●   Limitations
         –   only one schedule type may be used for a given loop
         –   the chunk size applies to all threads
Parallel Loop Scheduling - Example
                                                                      A Guide to OpenMP




       Fortran
                 !$OMP PARALLEL SHARED(A, B, C) PRIVATE(I)
                 !$OMP DO SCHEDULE (DYNAMIC,4)
                       DO I = 1, N
                         C(I) = A(I) + B(I)
                       ENDDO
                 !$OMP END DO [nowait]
                 !$OMP END PARALLEL

                                              schedule   chunk size
       C/C++
                 #pragma omp   parallel shared(a, b, c) private(i)
                  {
                 #pragma omp   for schedule (guided,4) [nowait]
                    for (i =   0; i < N; i++)
                      c[i] =   a[i] + b[i];
                  }
The ordered Clause and ordered construct
                                                                      A Guide to OpenMP




    ●   An ordered loop contains code that must execute in
        serial order
    ●   No clauses are supported
    ●   The ordered code must be inside an ordered construct

             #pragma omp parallel shared(a, b, c) private(i)   ordered clause
               {
             #pragma omp for ordered
                 for (i = 0; i <= 99; i++) {
                   // do a lot of stuff concurrently
             #pragma omp ordered                               ordered construct
                   {
                     a = i * (b + c);
                     b = i * (a + c);
                     c = i * (a + b);
                   }
                 }
               }
The collapse Clause
                                                                        A Guide to OpenMP




    ●   n levels are collapsed into a combined iteration space
    ●   The schedule and ordered constructs are applied as
        expected
    ●   E.g.,
                      #pragma omp parallel shared(a, b, c) private(i)
                        {
                      #pragma omp for schedule(dynamic,4) collapse(2)
                          for (i = 0; i <= 2; i++) {
                            for (j = 0; j <= 2; j++) {
                              // do stuff for each i,j
                            }
                        }




    ●   Iterations are distributed as (i*j) n-tuples, e.g., <i,j>:
                –   <0,0>; <0,1>; <0,2>; <1,0>; <1,1>; <1,2>; <2,0>;
                      <2,1>; <2,2>;
The WORKSHARE Construct (Fortran Only)
                                                              A Guide to OpenMP




    ●   Provides for parallel execution of code using F90 array
        syntax
    ●   The clauses supported by the WORKSHARE construct are:
        private, firstprivate, copyprivate, nowait
    ●   There is an implicit barrier at the end of this construct
    ●   Valid Fortran code enclosed in a workshare construct:
            –   Array & scalar variable assignments
            –   FORALL statements & constructs
            –   WHERE statements & constructs
            –   User defned functions of type ELEMENTAL
            –   OpenMP atomic, critical, & parallel
The sections Construct
                                                           A Guide to OpenMP




    ●   The sections construct defnes code that is to be
        executed once by exactly one thread
    ●   A barrier is implied
    ●   Supported clauses include: private, firstprivate,
        lastprivate, reduction, nowait
A section Construct Example
                                                                                                   A Guide to OpenMP
  #include <stdio.h>
  #include <omp.h>
  int square(int n){
     return n*n;
  }
  int main(void){
    int x, y, z, xs, ys, zs;
    omp_set_dynamic(0);
    omp_set_num_threads(3);
    x = 2; y = 3; z = 5;

  #pragma omp parallel shared(xs,ys,zs)
    {
  #pragma omp sections
       {
  #pragma omp section
         { xs = square(x);
           printf ("id = %d, xs = %dn", omp_get_thread_num(), xs);
         }
  #pragma omp section
         { ys = square(y);
           printf ("id = %d, ys = %dn", omp_get_thread_num(), ys);
         }
  #pragma omp section
         { zs = square(z);
           printf ("id = %d, zs = %dn", omp_get_thread_num(), zs);
         }
       }
    }
      return 0;
  }

                                                                      http://developers.sun.com/solaris/articles/studio_openmp.h
                                                                      tml
A section Construct Example
                                                                         A Guide to OpenMP
       #pragma omp sections
         {
       #pragma omp section
            { xs = square(x);
              printf ("id = %d, xs = %dn", omp_get_thread_num(), xs);
            }
       #pragma omp section
            { ys = square(y);
              printf ("id = %d, ys = %dn", omp_get_thread_num(), ys);
            }
       #pragma omp section
            { zs = square(z);
              printf ("id = %d, zs = %dn", omp_get_thread_num(), zs);
            }
         }




                                     thread 0

                                     thread 1

                                     thread 2


                  t=0                                         t=1
                                       Time
Combined parallel Constructs
                                                                      A Guide to OpenMP




    ●   parallel may be combined with the following:
            –   for, do, sections, WORKSHARE
    ●   Semantics are identical to usage already discussed

   !$OMP PARALLEL DO SHARED(A, B, C) PRIVATE(I)
   !$OMP& SCHEDULE(DYNAMIC,4)
         DO I = 1, N
           C(I) = A(I) + B(I)
         ENDDO
   !$OMP END PARALLEL DO

   #pragma omp parallel for shared(a, b, c) private(i) schedule (guided,4)
    {
      for (i = 0; i < N; i++)
        c[i] = a[i] + b[i];
    }
Singling Out Threads with master and single Constructs
                                                                               A Guide to OpenMP




    ●    Code inside a master construct will only be executed by the master
         thread.
    ●    There is NO implicit barrier associated with master; other threads ignore it.

                              !$OMP MASTER
                                … do stuff
                              !$OMP END MASTER


    ●    Code inside a single construct will be executed by the frst thread to
         encounter it.
    ●    A single construct contains an implicit barrier that will respect nowait.
                     #pragma omp single [nowait] [copyprivate(list,)
                         {
        C/C++              … do stuff
                         }                                                   note position
                                                                             of clauses in
                     !$OMP SINGLE                                            Fortran
        Fortran        … do stuff
                     !$OMP END SINGLE [nowait] [copyprivate(list,)]
The task Construct
                                                              A Guide to OpenMP




    ●   Tasks were added in 3.0 to handle dynamic and
        unstructured applications
            –   Recursion
            –   Tree & graph traversals
    ●   OpenMP's execution model based on threads was
        redefned
    ●   A thread is considered to be an implicit task
    ●   The task construct defnes singular tasks explicitly
    ●   Less overhead than nested parallel regions
Threads are now Implicit Tasks
                                                     A Guide to OpenMP




                        cpu 0      Implicit task 0



                 p0


                        cpu 1      Implicit task 1



                   Main Memory
                                 process 0


                        cpu 2      Implicit task 2




                        cpu 3      Implicit task 3
                                     process 3
Each Thread Conceptually Has Both a tied & untied queue
                                                          A Guide to OpenMP




                       cpu 0     Implicit task 0


                                    tied (private)
                p0                 untied (public)



                       cpu 1     Implicit task 1


                                    tied (private)
                 Main Memory
                                   untied (public)



                       cpu 2     Implicit task 2

                                    tied (private)
                                   untied (public)



                       cpu 3     Implicit task 3
                                   process 3

                                    tied (private)
                                   untied (public)
The task Construct
                                                              A Guide to OpenMP




    ●   Clauses supported are: if, default, private,
        firstprivate shared, tied/untied
    ●   By default, all variables are firstprivate
    ●   Tasks can be nested syntactically, but are still
        asynchronous
    ●   The taskwait directive causes a task to wait until all its
        children have completed
A task Construct C/C++ Example
                                                     A Guide to OpenMP

  struct node {
     struct node *left;
     struct node *right;
  };

  extern void process(struct node *);

  void traverse( struct node *p ) {
    if (p->left)
  #pragma omp task // p is firstprivate by default
      traverse(p->left);
    if (p->right)
  #pragma omp task // p is firstprivate by default
      traverse(p->right);
    process(p);
  }
A task Construct Fortran Example
                                                 A Guide to OpenMP

        RECURSIVE SUBROUTINE traverse ( P )
          TYPE Node
            TYPE(Node), POINTER :: left, right
          END TYPE Node
          TYPE(Node) :: P
          IF (associated(P%left)) THEN
  !$OMP TASK ! P is firstprivate by default
            call traverse(P%left)
  !$OMP END TASK
          ENDIF
          IF (associated(P%right)) THEN
  !$OMP TASK ! P is firstprivate by default
            call traverse(P%right)
  !$OMP END TASK
          ENDIF
          CALL process ( P )
  END SUBROUTINE
Mutual Exclusion: critical, atomic, and omp_lock_t
                                                             A Guide to OpenMP




    ●   Some code must be executed by one thread at a time
    ●   Efectively serializes the threads execution of the code
    ●   Such code regions often called critical sections
    ●   OpenMP provides 3 ways to achieve mutual exclusion
            –   The critical construct encloses a critical
                  section
            –   The atomic construct enclose updates to shared
                  variables
            –   A low level, general purpose locking mechanism
The critical Construct
                                                              A Guide to OpenMP




    ●   The critical construct encloses code that should be
        executed by all threads, just in some serial order
                    #pragma omp parallel
                      {
                    #pragma omp critical
                        {
                          // some code
                        }
                      }

    ●   The efect is equivalent to a lock protecting the code
A critical Construct Example
                                                                       A Guide to OpenMP




                     #pragma omp parallel shared(a, b, c) private(i)
                       {
                     #pragma omp critical
                         {
                           //
                           // do stuff (one thread at a time)
                           //
                         }
                       }




                           thread 1      thread 0        thread 2



                          t=0         t=1             t=2

                                         Time
  Note:
  Encountering thread
  order not guaranteed!
The Named critical Construct
                                                              A Guide to OpenMP




   ●   Names may be applied to critical constructs.
                   #pragma omp parallel
                     {
                   #pragma omp critical(a)
                       {
                         // some code
                       }
                   #pragma omp critical(b)
                       {
                         // some code
                       }
                   #pragma omp critical(c)
                       {
                         // some code
                       }
                     }

   ●   The efect is equivalent to using a diferent lock for
       each section.
A Named critical Construct Example
                                                    A Guide to OpenMP
  #include <stdio.h>
  #include <omp.h>
  #define N 100

  int main(void)
  { float a[N], b[N], c[3];
    int i;
    /* Initialize arrays a and b. */
      for (i = 0; i < N; i++)
         { a[i] = i * 1.0 + 1.0;
            b[i] = i * 2.0 + 2.0;
         }
    /* Compute values of array c in parallel. */
  #pragma omp parallel shared(a, b, c) private(i)
    {
  #pragma omp critical(a)
       { for (i = 0; i < N; i++)
               C[ 0] += a[i] + b[i];
           printf("%fn",c[0]);
       }
  #pragma omp critical(b)
       { for (i = 0; i < N; i++)
               c[1] += a[i] + b[i];
           printf("%fn",c[1]);
       }
  #pragma omp critical(c)
       { for (i = 0; i < N; i++)
              c[2] += a[i] + b[i];
           printf("%fn",c[2]);
       }
    }
  }
A Named critical Construct Example
                                                                              A Guide to OpenMP




   #pragma omp critical(a)
       {
         // some code
       }
   #pragma omp critical(b)
       {                                   Note:
         // some code                      Encountering thread
       }                                   order not guaranteed!
   #pragma omp critical(c)
       {
         // some code
       }



                   thread 1   thread 0   thread 2


                              thread 1   thread 0    thread 2

                                         thread 1    thread 0      thread 2


                   t=0        t=1        t=2         t=3           t=1

                               Time
The atomic Construct for Safely Updating Shared Variables
                                                                         A Guide to OpenMP




    ●     Protected writes to shared variables
    ●     Typically lighter weight than a critical construct
                           #include <stdio.h>
                           #include <omp.h>

                           int main(void) {
                              int count = 0;
                           #pragma omp parallel shared(count)
                              {
                             #pragma omp atomic
                                  count++;
                              }
                              printf("Number of threads: %dn",count);
                           }

        Note:
        Encountering thread
        order not guaranteed!
Locks in OpenMP
                                                                    A Guide to OpenMP




    ●   omp_lock_t, omp_lock_kind
    ●   omp_nest_lock_t, omp_nest_lock_kind
    ●   Threads set/unset locks
    ●   Nested locks can be set multiple times by the same
        thread before releasing them
    ●   More fexible than critical and atomic constructs –
        the behavior of both is defned by behavior of locks
            –   E.g., “named” critical sections behave as if each
                  critical section used a diferent lock variable
    ●   Locking routines imply a memory flush
Single Locks vs Nested Locks
                                                            A Guide to OpenMP




    ●   Single locks may be set once and only once
    ●   Single locks must be unset by the owning thread, but
        only once
    ●   Nested locks may be set multiple times, but only by the
        thread initially setting it
    ●   Nested lock must be unset for each time it is set
    ●   Nested locks are not unlocked until the lock count has
        been decremented to 0
Using Locks in OpenMP
                                                              A Guide to OpenMP

  #include <stdlib.h>
  #include <stdio.h>
  #include <omp.h>

  /*
       this example mimics the behavior of the atomic
       construct used to safely update locked variables
  */

  int main()
  {
    int x;
    omp_lock_t lck;
    /* master thread initialized lock var before fork*/
    omp_init_lock(&lck);
    x = 0;
  #pragma omp parallel shared (x,lck)
    {
      omp_set_lock(&lck);   /* threads wait to acquire lock */
         x = x+1;           /* thread holding lock updates x */
      omp_unset_lock(&lck); /* thread releases lock */
    }
    /* master thread destroy lock after join */
    omp_destroy_lock (&lck);
  }
Using Nested Locks in OpenMP
                                                         A Guide to OpenMP
  #include <omp.h>
  typedef struct {
    int a,b; omp_nest_lock_t lck; } pair;
  int work1();
  int work2();
  int work3();
  void incr_a(pair *p, int a) {
    /* Called only from incr_pair, no need to lock. */
    p->a += a;
  }
  void incr_b(pair *p, int b) {
    /* Called both from incr_pair and elsewhere, */
    /* so need a nestable lock. */
    omp_set_nest_lock(&p->lck);
    p->b += b;
    omp_unset_nest_lock(&p->lck);
  }
  void incr_pair(pair *p, int a, int b) {
    omp_set_nest_lock(&p->lck);
    incr_a(p, a);
    incr_b(p, b);
    omp_unset_nest_lock(&p->lck);
  }
  void a45(pair *p) {
  #pragma omp parallel sections
    {
  #pragma omp section
      incr_pair(p, work1(), work2());
  #pragma omp section
      incr_b(p, work3());
    }
  }
Fortran Programming Tips
                                                                           A Guide to OpenMP




    ●   Directives are case insensitive
    ●   In fxed form Fortran OpenMP directives can hide behind
        the following “sentinals”
                !$[OMP],c$[OMP],*$[OMP]
    ●   Free form requires “!$”
    ●   Sentinals can enable conditional compilation
                !$ omp_set_num_threads(n)
    ●   Fortran directives should start in column 0
    ●   Long directive continuations take a form similar to:
        !$OMP PARALLEL DEFAULT(NONE)
        !$OMP& SHARED(INP,OUTP,BOXL,TEMP,RHO,NSTEP,TSTEP,X,Y,Z,VX,VY,VZ,BOXL)
        !$OMP& SHARED(XO,YO,ZO,TSTEP,V2T,VXT,VYT,VZT,IPRINT,ISTEP,ETOT,ERUN)
        !$OMP& SHARED(FX,FY,FZ,PENER)
        !$OMP& PRIVATE(I)
C/C++ Programming Tips
                                                                            A Guide to OpenMP




    ●   No line continuations, entire directive on single line
    ●   Directives are case sensitive
    ●   No conditional compilation sentinals, use “#ifdef”, etc
    ●   Code Style:   int main (...) {
                        ...
                      #pragma parallel
all #pragmas in         {
                      #pragma omp sections
col. 0                    {
                      #pragma omp section
                            { xs = square(x);
braces indented             }
                              printf ("id = %d, xs = %dn", omp_get_thread_num(), xs);
as usual              #pragma omp section
                            { ys = square(y);
                              printf ("id = %d, ys = %dn", omp_get_thread_num(), ys);
                            }
                          }
                        }
                        return 0; /* end main */
                      }
General Programming Tips
                                                                    A Guide to OpenMP




    ●   Minimize parallel constructs
    ●   Use combined constructs, if it doesn't violate the above
    ●   Minimize shared variables, maximize private
    ●   Minimize barriers, but don't sacrifce safety
    ●   When inserting OpenMP into existing code
            –   Use a disciplined, iterative cycle – test against serial
                  version
            –   Use barriers liberally
            –   Optimize OpenMP & asynchronize last
    ●   When starting from scratch
            –   Start with an optimized serial version
The Future of OpenMP
                                                                A Guide to OpenMP




    ●   Vendor buy-in and R&D support is as strong as ever
    ●   Must remain relevant
    ●   Active areas of research:
             –   Refnement to tasking model (scheduling, etc)
             –   User defned reductions (UDRs)
             –   Accelerators & heterogeneous environments
             –   Error handling
             –   Hybrid models
    ●   Scaling issues being addressed:
             –   Thousands of threads
             –   Data locality
             –   More fexible & efcient synchronization
A Practical Example – Calculating   π
                                                                     A Guide to OpenMP

  static long num_steps = 100000;
  double step;                            Mathematically, we know:
  void main ()
  {     int i; double x, pi, sum = 0.0;
        step = 1.0/(double) num_steps;
        for (i=0;i<= num_steps; i++){
            x = (i+0.5)*step;
            sum = sum + 4.0/(1.0+x*x);    And this can be approximated
        }
        pi = step * sum;                  as a sum of the area of rectangles:
  }




                                          Where each rectangle has a width
                                          of Δx and a height of F(xi) at the
                                          middle of interval i.
A Naïve Parallel Program - Calculating   π
                                                                            A Guide to OpenMP

  #include <omp.h>                                          Promote scalar to an array
  static long num_steps = 100000; double step;              dimensioned by number of
  #define NUM_THREADS 2                                     threads to avoid race
  void main ()                                              condition
  {
    int i, id, nthreads; double x, pi, sum[NUM_THREADS];
    step = 1.0/(double) num_steps;
    omp_set_num_threads(NUM_THREADS);                       Can't assume the number
  #pragma omp parallel private (i, id, x)                   of threads requested; use
    {                                                       single to prevent conficts.
      id = omp_get_thread_num();
  #pragma omp single
        nthreads = omp_get_num_threads();
      for (i=id, sum[id]=0.0;i< num_steps; i=i+nthreads){   False sharing of the sum
        x = (i+0.5)*step;                                   array will afect
                                                            performance
        sum[id] += 4.0/(1.0+x*x);
      }
    }
    for(i=0, pi=0.0;i<nthreads;i++)pi += sum[i] * step;
  }
A Slightly More Efcient Parallel Program - Calculating   π
                                                                              A Guide to OpenMP

  #include <omp.h>
  static long num_steps = 100000; double step;
  #define NUM_THREADS 2
  void main ()
  {
    int i, id, nthreads; double x, pi, sum;
    step = 1.0/(double) num_steps;
    omp_set_num_threads(NUM_THREADS);
  #pragma omp parallel private (i, id, x, sum)
    {
      id = omp_get_thread_num();
  #pragma omp single
        nthreads = omp_get_num_threads();
      for (i=id, sum=0.0;i< num_steps; i=i+nthreads){
                                                             No array, so no false sharing
        x = (i+0.5)*step;
        sum += 4.0/(1.0+x*x);
      }
  #pragma omp critical
      pi += sum * step;                                      Note: this method of
                                                             combining partial sums
    }                                                        doesn't scale very well.
  }
A Simple & Efcient Parallel Program - Calculating         π
                                                                              A Guide to OpenMP

   #include <omp.h>
   static long num_steps = 100000; double step;
   #define NUM_THREADS 2                                       For good OpenMP
   void main ()                                                implementations, reduction
   {                                                           is more scalable than a
     int i; double x, pi, sum = 0.0;                           critical construct
     step = 1.0/(double) num_steps;
     omp_set_num_threads(NUM_THREADS);
   #pragma omp parallel for private(x) reduction(+:sum)
     for (i=0;i<= num_steps; i++){
       x = (i+0.5)*step;
       sum = sum + 4.0/(1.0+x*x);
     }
     pi = step * sum;
   }

  i private
  by default




                Note: a parallel program is created without changing
                any existing code and adding 4 simple lines.
The Final Parallel Program - Calculating    π
                                                                                A Guide to OpenMP

   #include <omp.h>
   static long num_steps = 100000; double step;
   #define NUM_THREADS 2                                         For good OpenMP
   void main ()                                                  implementations, reduction
   {                                                             is more scalable than a
     int i; double x, pi, sum = 0.0;                             critical construct
     step = 1.0/(double) num_steps;
     omp_set_num_threads(NUM_THREADS);
   #pragma omp parallel for private(x) reduction(+:sum)
     for (i=0;i<= num_steps; i++){
       x = (i+0.5)*step;
       sum = sum + 4.0/(1.0+x*x);
     }
     pi = step * sum;
   }

  i private
  by default




                In practice, the number of threads is set using the
                environmental variable, OMP_NUM_THREADS.
Additional Resources
                                                           A Guide to OpenMP




    ●   http://www.cs.uh.edu/~hpctools
    ●   http://www.compunity.org
    ●   http://www.openmp.org
            –   Specifcation 3.0
    ●   “Using OpenMP”, Chapman, et. al.




                                           Covers through 2.5
Tutorial Contributors
                                                                           A Guide to OpenMP




    ●   Brett Estrade
    ●   Dr. Barbara Chapman
    ●   Amrita Banerjee


           The latest version of this presentation may be downloaded at:

                   http://www.cs.uh.edu/~hpctools/OpenMP

     All feedback is welcome and appreciated. Please send all feedback to Brett:

                               estrabd@cs.uh.edu

More Related Content

Viewers also liked

Viewers also liked (8)

Presentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel ProgrammingPresentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel Programming
 
Biref Introduction to OpenMP
Biref Introduction to OpenMPBiref Introduction to OpenMP
Biref Introduction to OpenMP
 
OpenMP
OpenMPOpenMP
OpenMP
 
Open mp intro_01
Open mp intro_01Open mp intro_01
Open mp intro_01
 
Openmp
OpenmpOpenmp
Openmp
 
OpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersOpenMP Tutorial for Beginners
OpenMP Tutorial for Beginners
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMP
 
OpenMP
OpenMPOpenMP
OpenMP
 

Similar to A Guide to OpenMP in 38 Characters

lec5 - The processor.pptx
lec5 - The processor.pptxlec5 - The processor.pptx
lec5 - The processor.pptxMahadevaAH
 
[COSCUP 2018] uTensor C++ Code Generator
[COSCUP 2018] uTensor C++ Code Generator[COSCUP 2018] uTensor C++ Code Generator
[COSCUP 2018] uTensor C++ Code GeneratorYin-Chen Liao
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecturePiyush Mittal
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architecturesnextlib
 
How to Diagnose Problems Quickly on Linux Servers
How to Diagnose Problems Quickly on Linux ServersHow to Diagnose Problems Quickly on Linux Servers
How to Diagnose Problems Quickly on Linux ServersRichard Cunningham
 

Similar to A Guide to OpenMP in 38 Characters (7)

lec5 - The processor.pptx
lec5 - The processor.pptxlec5 - The processor.pptx
lec5 - The processor.pptx
 
[COSCUP 2018] uTensor C++ Code Generator
[COSCUP 2018] uTensor C++ Code Generator[COSCUP 2018] uTensor C++ Code Generator
[COSCUP 2018] uTensor C++ Code Generator
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
 
The Spectre of Meltdowns
The Spectre of MeltdownsThe Spectre of Meltdowns
The Spectre of Meltdowns
 
How to Diagnose Problems Quickly on Linux Servers
How to Diagnose Problems Quickly on Linux ServersHow to Diagnose Problems Quickly on Linux Servers
How to Diagnose Problems Quickly on Linux Servers
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

A Guide to OpenMP in 38 Characters

  • 1. A Guide To OpenMP A Guide to OpenMP Presented By: Dr. Barbara Chapman Created By: Brett Estrade HPCTools Group University of Houston Department of Computer Science http://www.cs.uh.edu/~hpctools
  • 2. An Example Shared Memory System & Cache Hierarchy A Guide to OpenMP cpu 0 cpu 2 L1 L1 L2 Main Memory L2 L1 L1 cpu 1 cpu 3
  • 3. Multiple Processes on this System A Guide to OpenMP cpu 0 process 0 p0 cpu 1 process 1 p1 Main Memory p2 cpu 2 process 2 p3 cpu 3 process 3
  • 4. Multiple Processes on this System – this is MPI's model A Guide to OpenMP cpu 0 process 0 p0 recv send cpu 1 process 1 p1 recv send p2 cpu 2 process 2 p3 recv send cpu 3 process 3
  • 5. A Single Process with Multiple Threads on this System A Guide to OpenMP cpu 0 Thread 0 p0 cpu 1 Thread 1 Main Memory process 0 cpu 2 Thread 2 cpu 3 process 33 Thread
  • 6. OpenMP's Execution Model – Fork & Join A Guide to OpenMP Main Memory t0 cpu 0 Do stuff t0 cpu 0 cpu 1 Do stuff t1 t1 t0 F J t0 cpu 2 process 0 t2 Do stuff t2 cpu 3 t3 Do stuff t3 Here, thread 0 is on CPU 0; at the fork, 3 new threads are created and are distributed to the remaining 3 CPUs.
  • 7. An Example of Sharing Memory A Guide to OpenMP cpu 0 Thread 0 p0 int a cpu 1 Thread 1 Main Memory process 0 cpu 2 Thread 2 cpu 3 process 33 Thread
  • 8. An Example of Sharing Memory A Guide to OpenMP cpu 0 Thread 0 write p0 int a cpu 1 Thread 1 Main Memory process 0 cpu 2 Thread 2 cpu 3 process 33 Thread Thread 0 writes integer a.
  • 9. An Example of Sharing Memory A Guide to OpenMP cpu 0 Thread 0 write p0 int a read cpu 1 Thread 1 Main Memory process 0 cpu 2 Thread 2 cpu 3 process 33 Thread Thread 1 reads integer a.
  • 10. An Example of Sharing Memory A Guide to OpenMP cpu 0 Thread 0 write p0 int a read cpu 1 Thread 1 read Main Memory process 0 cpu 2 Thread 2 cpu 3 process 33 Thread Thread 2 reads integer a.
  • 11. An Example of Sharing Memory A Guide to OpenMP cpu 0 Thread 0 write p0 int a read cpu 1 Thread 1 read Main Memory process 0 cpu 2 Thread 2 read cpu 3 process 33 Thread Thread 3 reads integer a.
  • 12. Notes on the Example of Sharing Memory A Guide to OpenMP ● Threads execute their code independently ● So usually Read/Write order matters ● If the order is not specifed in some way, this could represent a (data) race condition ● Race conditions introduce non-determinism (not good) ● Threaded programs can be extremely difcult to debug ● Proper precautions must be made to eliminate data races
  • 13. What is OpenMP? A Guide to OpenMP ● An industry standard for shared memory parallel programming – OpenMP Architecture Review Board – AMD, Intel, IBM, HP, Microsoft, Sun/Oracle, Fujitsu, NEC, Texas Instruments, PGI, CAPS; LLNL, ORNL, ANL, NASA, cOMPunity,.. – http://www.openmp.org/ ● A set of directives for describing parallelism in an application code ● A user-level API and runtime environment ● A widely supported standard set of parallel programming pragmas with bindings for Fortran, C, & C++ ● A community of active users & researchers
  • 14. The Anatomy of an OpenMP Program A Guide to OpenMP #include <stdio.h> #include <omp.h> parallel (fork) int main (int argc, char *argv[]) { runtime function directive int tid, numt; numt = omp_get_num_threads(); #pragma omp parallel private(tid) shared(numt) { tid = omp_get_thread_num(); printf("hi, from %dn", tid); clauses structured #pragma omp barrier if ( tid == 0 ) { directive parallel block printf("%d threads say hi!n",numt); } (thread barrier) } return 0; }
  • 15. OpenMP is Really a Standard Specifcation A Guide to OpenMP The timeline of the OpenMP Standard Specifcation
  • 16. OpenMP vs MPI A Guide to OpenMP There is no silver bullet And that makes Teen Wolf happy
  • 17. Some Benefts of using OpenMP A Guide to OpenMP ● It's portable, supported by most C/C++ & Fortran compilers ● Much of serial code can be left untouched (most of the time) ● The development cycle is a friendly one – Can be introduced iteratively into existing code – Correctness can be verifed along the way – Likewise, performance benefts can be gauged ● Optimizing memory access in the serial program will beneft the threaded version (can help avoid false sharing, etc) ● It can be fun to use (immediate gratifcation)
  • 18. What Does OpenMP Provide? A Guide to OpenMP ● More abstract programming interface than low level thread libraries ● User specifes strategy for parallel execution, but not all the details ● Directives, written as structured comments ● A runtime library that manages execution dynamically ● Control via environment variables & the runtime API ● Expectations of behavior & sensible defaults ● A promise of interface portability;
  • 19. What Compilers Support OpenMP? A Guide to OpenMP Vendor Languages Supported Specification IBM C/C++(10.1),Fortran(13.1) Full 3.0 support Sun/Oracle C/C++,Fortran(12.1) Full 3.0 support Intel C/C++,Fortran(11.0) Full 3.0 support Portland Group C/C++,Fortran Full 3.0 support Absoft Fortran(11.0) Full 2.5 support Lahey/Fujitsu C/C++,Fortran(6.2) Full 2.0 support PathScale C/C++,Fortran Full 2.5 support (based on Open64) HP C/C++,Fortran Full 2.5 support Cray C/C++,Fortran Full 3.0 on Cray XT Series Linux GNU C/C++,Fortran Working towards full 3.0 Microsoft C/C++,Fortran Full 2.0
  • 20. OpenMP Research Activities A Guide to OpenMP ● A lot of research has gone into the OpenMP standard ● International Workshop on OpenMP (IWOMP) ● Suites: validation, NAS, SPEC, EPCC, BOTS ● Open Source Research Compilers: – OpenUH – NANOS – Rose/{OMNI,GCC} – MPC, etc – Commercial R&D ● cOMPunity - http://www.compunity.org ● Applications work with new features
  • 21. Compiling and Executing Examples A Guide to OpenMP ● IBM XL Suite: – xlc_r, xlf90, etc % xlc_r -qsmp=omp test.c -o test.x # compile it bash % OMP_NUM_THREADS=4 ./test.x # execute it ● OpenUH: ● uhcc, uhf90, etc % uhcc -mp test.c -o test.x # compile it bash % OMP_NUM_THREADS=4 ./test.x # execute it
  • 22. OpenMP Directives & Constructs A Guide to OpenMP ● Contained inside structured comments C/C++: #pragma omp <directive> <clauses> Fortran: !$OMP <directive> <clauses> ● OpenMP compliant compilers fnd and parse directives ● Non-compliant should safely ignore them as comments ● A construct is a directive that afects the enclosing code ● Imperative (standalone) directives exist ● Clauses control the behavior of directives ● Order of clauses has no bearing on efect
  • 23. Summary of OpenMP's Directives and Key Clauses A Guide to OpenMP ● Forking Threads ● Synchronization parallel barrier ● Distributing Work flush for (C/C++) ordered DO (Fortran) taskwait sections/section ● Asynchronous Tasking WORKSHARE (Fortran) task ● Singling Out Threads ● Data Environment single shared Master private threadprivate ● Mutual Exclusion reduction – critical – atomic
  • 24. The OpenMP “Runtime” Library (RTL) A Guide to OpenMP ● The “runtime” manages the multi-threaded execution: – It's used by the resulting executable OpenMP program – It's what spawns threads (e.g., calls pthreads) – It's what manages shared & private memory – It's what distributes (shares) work among threads – It's what synchronizes threads & tasks – It's what reduces variables and keeps lastprivate – It's what is infuenced by envars & the user level API ● Doxygen docs of OpenUH's OpenMP RTL, libopenmp ● The Doxygen call graph for __ompc_fork in libopenmp/omp_threads.c http://www2.cs.uh.edu/~estrabd/OpenUH/r593/html-libopenmp/
  • 25. _Call Graph of __ompc_fork in libopenmp/threads.c A Guide to OpenMP
  • 26. Useful OpenMP Environment Variables A Guide to OpenMP ● OMP_NUM_THREADS ● OMP_SCHEDULE ● OMP_DYNAMIC ● OMP_STACKSIZE ● OMP_NESTED ● OMP_THREAD_LIMIT ● OMP_MAX_ACTIVE_LEVELS ● OMP_WAIT_POLICY
  • 27. 3 Types of Runtime Functions A Guide to OpenMP Execution environment routines; e.g., – omp_{set,get}_num_threads – omp_{set,get}_dynamic – Each envar has a corresponding get/set Locking routines (generalized mutual exclusion); e.g., – omp_{init,set,test,unset,destroy}_lock – omp_{...}_nest_lock Timing routines; e.g., – omp_get_wtime – omp_get_wtick
  • 28. How Is OpenMP Implemented? A Guide to OpenMP ● A compiler translates OpenMP – It processes the directives and uses them to create explicitly multithreaded code – It translates the rest of the Fortran/C/C++ more or less as usual – Each compiler has its own strategy ● The generated code makes calls to the runtime library ● We have already seen what the runtime is typically responsible for – The runtime library (RTL) also implements OpenMP's user-level runtime routines – Each compiler has its own custom RTL
  • 29. How Is an OpenMP Program Compiled? Here's How OpenUH does it. A Guide to OpenMP Parse into Very High WHIRL High level, serial program optimizations Put OpenMP constructs into standard format Loop level serial program optimizations Transform OpenMP into threaded code Optimize, now, potentially, threaded code Output native instr. or interface with native compiler via source to source translation Liao, et. al.: http://www2.cs.uh.edu/~copper/openuh.pdf
  • 30. What Does the Transformed Code Look like? A Guide to OpenMP ● Intermediate code,“W2C” - WHIRL to C – uhcc -mp -gnu3 -CLIST:emit_nested_pu simple.c – http://www2.cs.uh.edu/~estrabd/OpenMP/simple/ static void __omprg_main_1(__ompv_gtid_a, __ompv_slink_a) _INT32 __ompv_gtid_a; _UINT64 __ompv_slink_a; { #include <stdio.h> register _INT32 _w2c___comma; int main(int argc, char *argv[]) { _UINT64 _temp___slink_sym0; int my_id; _INT32 __ompv_temp_gtid; #pragma omp parallel default(none) private(my_id) _INT32 __mplocal_my_id; { my_id = omp_get_thread_num(); /*Begin_of_nested_PU(s)*/ printf("hello from %dn",my_id); } _temp___slink_sym0 = __ompv_slink_a; return 0; __ompv_temp_gtid = __ompv_gtid_a; } _w2c___comma = omp_get_thread_num(); __mplocal_my_id = _w2c___comma; printf("hello from %dn", __mplocal_my_id); return; The original main() } /* __omprg_main_1 */ main is outlined to __omprg_main_1()
  • 31. The new main() A Guide to OpenMP extern _INT32 main() { register _INT32 _w2c___ompv_ok_to_fork; register _UINT64 _w2c_reg3; register _INT32 _w2c___comma; _INT32 my_id; _INT32 __ompv_gtid_s1; calls RTL fork and passes function pointer to outlined /*Begin_of_nested_PU(s)*/ main() _w2c___ompv_ok_to_fork = 1; if(_w2c___ompv_ok_to_fork) { __omprg_main_1's _w2c___ompv_ok_to_fork = __ompc_can_fork(); } frame pointer if(_w2c___ompv_ok_to_fork) { __ompc_fork(0, &__omprg_main_1, _w2c_reg3); } else { __ompv_gtid_s1 = __ompc_get_local_thread_num(); __ompc_serialized_parallel(); _w2c___comma = omp_get_thread_num(); my_id = _w2c___comma; serial version printf("hello from %dn", my_id); __ompc_end_serialized_parallel(); } return 0; } /* main */ Nobody wants to code like this, so we let the compiler and runtime do most of this tedious work!
  • 32. A Guide to OpenMP Programming with OpenMP 3.0
  • 33. The parallel Construct A Guide to OpenMP ● Where the “fork” occurs (e.g.,__ompc_fork(...)) ● Encloses all other OpenMP constructs & directives ● This construct accepts the following clauses: if, num_threads, private, firstprivate, shared, default, copyin, reduction ● Can call functions that contain “orphan” constructs – Statically (lexically) outside parallel region, but dynamically inside during runtime ● Can be nested
  • 34. A Simple OpenMP Example A Guide to OpenMP C/C++ get number of threads #include <stdio.h> #include <omp.h> int main (int argc, char *argv[]) { fork int tid, numt; numt = omp_get_num_threads(); #pragma omp parallel private(tid) shared(numt) get thread id { tid = omp_get_thread_num(); printf("hi, from %dn", tid); wait for all threads #pragma omp barrier if ( tid == 0 ) { printf("%d threads say hi!n",numt); } join (implicit barrier, all wait) } return 0; } Output using 4 threads: hi, from 3 hi, from 0 hi, from 2 hi, from 1 4 threads say hi! Note, thread order not guaranteed!
  • 35. The Fortran Version A Guide to OpenMP C/C++ F90 #include <stdio.h> program hello90 #include <omp.h> use omp_lib integer:: tid, numt int main (int argc, char *argv[]) { numt = omp_get_num_threads() int tid, numt; !$omp parallel private(id) shared(numt) numt = omp_get_num_threads(); tid = omp_get_thread_num() #pragma omp parallel private(tid) shared(numt) write (*,*) 'hi, from', tid { !$omp barrier tid = omp_get_thread_num(); if ( tid == 0 ) then printf("hi, from %dn", tid); write (*,*) numt,'threads say hi!' #pragma omp barrier end if if ( tid == 0 ) { !$omp end parallel printf("%d threads say hi!n",numt); end program } } return 0; } Output using 4 threads: hi, from 3 hi, from 0 hi, from 2 hi, from 1 4 threads say hi! Note, thread order not guaranteed!
  • 36. Now, Just the Parallelized Code A Guide to OpenMP C/C++ F90 #include <stdio.h> program hello90 #include <omp.h> use omp_lib integer:: tid, numt int main (int argc, char *argv[]) { numt = omp_get_num_threads() int tid, numt; !$omp parallel private(id) shared(numt) numt = omp_get_num_threads(); tid = omp_get_thread_num() #pragma omp parallel private(tid) shared(numt) write (*,*) 'hi, from', tid { !$omp barrier tid = omp_get_thread_num(); if ( tid == 0 ) then printf("hi, from %dn", tid); write (*,*) numt,'threads say hi!' #pragma omp barrier end if if ( tid == 0 ) { !$omp end parallel printf("%d threads say hi!n",numt); end program } } return 0; }} Output using 4 threads: hi, from 3 hi, from 0 hi, from 2 hi, from 1 4 threads say hi! Note, thread order not guaranteed!
  • 37. Trace of The Execution A Guide to OpenMP all threads call printf only thread with tid == 0 does this join 0 == 0 0 hi, from 0 B 0 fork 4 threads say hi! 1 hi, from 1 B 1 != 0 1 F J 2 hi, from 2 B 2 !=0 2 3 hi, from 3 B 3 != 0 3 other threads wait thread barrier B = wait for all threads @ barrier before progressing further.
  • 38. Controlling the Fork at Runtime A Guide to OpenMP ● The “if” clause contains a conditional expression. ● If TRUE, forking occurs, else it doesn't int n = some_func(); #pragma omp parallel if(n>5) { … do stuff in parallel } ● The “num_threads” clause is another way to control the number of threads active in a parallel contruct int n = some_func(); #pragma omp parallel num_threads(n) { … do stuff in parallel }
  • 39. The Data Environment Among Threads A Guide to OpenMP ● default([shared]|none|private) ● shared(list,) - supported by parallel construct only ● private(list,) ● firstprivate(list,) ● lastprivate(list,) - supported by loop & sections constructs only ● reduction(<op>:list,) ● copyprivate(list,) - supported by single construct only ● threadprivate – a standalone directive, not a clause #pragma omp threadprivate(list,) !$omp threadprivate(list,) ● copyin(list,) - supported by parallel construct only
  • 40. Private Variable Initialization A Guide to OpenMP ● private(list,) – Initialized value of variable(s) is undefned ● firstprivate(list,) – private variables are initialized with value of corresponding variable from master thread at time of fork ● copyin(list,) – Copy value of master's threadprivate variables to corresponding variables of the other threads ● threadprivate(list,) – Global lifetime objects are replicated so each thread has its own copy – static variables in C/C++, COMMON blocks in Fortran
  • 41. Getting Data Out A Guide to OpenMP ● copyprivate(list,) – Used by single thread to pass list values to corresponding private vars in the other, non-executing, threads ● lastprivate(list,) – vars in list will be given the last value assigned to it by a thread – supported by loop & sections construct ● reduction(<op>:list,) – aggregates vars in list using the defned operation – supported by parallel, loop, & sections constructs – <op> must be an actual operator or an intrinsic function
  • 42. private & shared in that Simple OpenMP Example A Guide to OpenMP C/C++ F90 #include <stdio.h> program hello90 #include <omp.h> use omp_lib integer:: tid, numt int main (int argc, char *argv[]) { numt = omp_get_num_threads() int tid, numt; !$omp parallel private(tid) shared(numt) numt = omp_get_num_threads(); tid = omp_get_thread_num() #pragma omp parallel private(tid) shared(numt) write (*,*) 'hi, from', tid { !$omp barrier tid = omp_get_thread_num(); if ( tid == 0 ) then printf("hi, from %dn", tid); write (*,*) numt,'threads say hi!' #pragma omp barrier end if if ( tid == 0 ) { !$omp end parallel printf("%d threads say hi!n",numt); end program } } return 0; } Output using 4 threads: hi, from 3 hi, from 0 hi, from 2 hi, from 1 4 threads say hi! Note, thread order not guaranteed!
  • 43. Memory Consistency Model A Guide to OpenMP ● OpenMP uses a “relaxed consistency” model ● In contrast especially to sequential consistency ● Cores may have out of date values in their cache ● Most constructs imply a “flush” of each thread's cache ● Treated as a memory “fence” by compilers when it comes to reordering operations ● OpenMP provides an explicit fush directive #pragma flush (list,) !$OMP FLUSH(list,)
  • 44. Synchronization A Guide to OpenMP ● Explict sync points are enabled with a barrier: #pragma omp barrier !$omp barrier ● Implicit sync points exist at the end of: – parallel, for, do, sections, single, WORKSHARE ● Implicit barriers can be turned of with “nowait” ● There is no barrier associated with: – critical, atomic, master ● Explicit barriers must be used when ordering is required and otherwise not guaranteed
  • 45. An explicit barrier in that Simple OpenMP Example A Guide to OpenMP C/C++ F90 #include <stdio.h> program hello90 #include <omp.h> use omp_lib integer:: tid, numt int main (int argc, char *argv[]) { numt = omp_get_num_threads() int tid, numt; !$omp parallel private(id) shared(numt) numt = omp_get_num_threads(); tid = omp_get_thread_num() #pragma omp parallel private(tid) shared(numt) write (*,*) 'hi, from', tid { !$omp barrier tid = omp_get_thread_num(); if ( tid == 0 ) then printf("hi, from %dn", tid); write (*,*) numt,'threads say hi!' #pragma omp barrier end if if ( tid == 0 ) { !$omp end parallel printf("%d threads say hi!n",numt); end program } } return 0; } Output using 4 threads: hi, from 3 hi, from 0 hi, from 2 hi, from 1 <barrier> 4 threads say hi!
  • 46. Trace of The Execution A Guide to OpenMP 0 == 0 0 hi, from 0 B 0 4 threads say hi! 1 hi, from 1 B 1 != 0 1 F J 2 hi, from 2 B 2 !=0 2 3 hi, from 3 B 3 != 0 3 #pragma omp barrier B = wait for all threads @ barrier before progressing further.
  • 47. The reduction Clause A Guide to OpenMP ● Supported by parallel and worksharing constructs – parallel, for, do, sections ● Creates a private copy of a shared var for each thread ● At the end of the construct containing the reduction clause, all private values are reduced into one using the specifed operator or intrinsic function #pragma omp parallel reduction(+:i) !$omp parallel reduction(+:i)
  • 48. A reduction Example A Guide to OpenMP C/C++ F90 #include <stdio.h> program hello90 #include <omp.h> use omp_lib integer:: t, i, numt int main (int argc, char *argv[]) { i = 0 int t, i; !$omp parallel private(t) reduction(+:i) i = 0; t = omp_get_thread_num() #pragma omp parallel private(t) reduction(+,i) i = t + 1; { write (*,*) 'hi, from', t t = omp_get_thread_num(); !$omp barrier i = t + 1; if ( t == 0 ) then printf("hi, from %dn", t); numt = omp_get_num_threads() #pragma omp barrier write (*,*) numt,'threads say hi!' if ( t == 0 ) { end if int numt = omp_get_num_threads(); !$omp end parallel printf("%d threads say hi!n",numt); write (*,*) 'i is reduced to ', i } end program } printf(“i is reduced to %dn”,i); return 0; } Output using 4 threads: hi, from 3 hi, from 0 hi, from 2 hi, from 1 4 threads say hi! i is reduced to 10
  • 49. Trace of a variable reduction A Guide to OpenMP a shared variable is initialized reduction occurs at end of the construct 0 i0 = 0 + 1 i = 0 1 i1 = 1 + 1 i = i+i0+i1+i2+i3 ... F i = 1+2+3+4 J ... 2 i2 = 2 + 1 i = 10 i = 10 3 i3 = 3 + 1 fnal value of i is available after end of construct each thread writes to its private variable
  • 50. Valid Operations for the reduction Clause A Guide to OpenMP ● Reduction operations in C/C++: – Arithmetic: + - * – Bitwise: &^| – Logical: && || ● Reduction operations in Fortran – Equivalent arithmetic, bitwise, and logical operations – min, max ● User defined reductions (UDR) is an area of current research ● Note: initialized value matters!
  • 51. Nested parallel Constructs A Guide to OpenMP ● Specifcation makes this feature optional – OMP_NESTED={true,false} – OMP_MAX_ACTIVE_LEVELS={1,2,..} – omp_{get,set}_nested() – omp_get_level() – omp_get_ancestor_thread_num(level) ● Each encountering thread becomes the master of the newly forked team ● Each subteam is numbered 0 through N-1 ● Useful, but still incurs parallel overheads
  • 52. The Uniqueness of Thread Numbers in Nesting A Guide to OpenMP Thread numbers are not unique; paths to each thread are. Nest Level 0 0 __ompc_fork(..) Nest Level 1 __ompc_fork(..) __ompc_fork(..) __ompc_fork(..) 2 1 0 Nest Level 2 2 1 0 2 1 0 2 1 0 <0,2,2> <0,0,2> Because of the tree structure, each thread can be uniquely identifed by its full path from the root of its sub-tree; This path tuple can be calculated in O(level) using omp_get_level and omp_get_ancestor_thread_num in combination.
  • 53. Work Sharing Constructs A Guide to OpenMP ● Threads “share” work, access data in shared memory ● OpenMP provides “work sharing” constructs ● These specify the work to be distributed among the threads and state how the work is assigned to them ● Data race conditions are illegal ● They include: – loops (for, DO) – sections – WORKSHARE (Fortran only) – single, master
  • 54. The Loop Constructs: for & DO A Guide to OpenMP ● The loop constructs distribute iterations among threads according to some schedule (default is static) ● Among frst constructs used when introducing OpenMP ● The clauses supported by the loop constructs are: private, firstprivate, lastprivate, reduction, schedule, order, collapse, nowait ● The loop's schedule refers to the runtime policy used to distribute work among the threads.
  • 55. OpenMP Parallelizes Loops by Distributing Iterations to Each Thread A Guide to OpenMP int i; #pragma omp for for (i=0;i <= 99; i++) { // do stuff } for (i=0;i <= 33; i++) { for (i=68;i <= 99; i++) { // do stuff // do stuff } } thread 0 for (i=34;i <= 67; i++) { thread 2 i = 0 thru 33 // do stuff i = 68 thru 99 } thread 1 i = 34 thru 67
  • 56. Parallelizing Loops - C/C++ A Guide to OpenMP #include <stdio.h> #include <omp.h> #define N 100 int main(void) { float a[N], b[N], c[N]; int i; omp_set_dynamic(0); // ensures use of all available threads omp_set_num_threads(20); // sets number of all available threads to 20 /* Initialize arrays a and b. */ for (i = 0; i < N; i++) { a[i] = i * 1.0; b[i] = i * 2.0; } /* Compute values of array c in parallel. */ #pragma omp parallel shared(a, b, c) private(i) { #pragma omp for [nowait] for (i = 0; i < N; i++) c[i] = a[i] + b[i]; } printf ("%fn", c[10]); } http://developers.sun.com/solaris/articles/studio_openmp.html
  • 57. Parallelizing Loops - C/C++ A Guide to OpenMP #include <stdio.h> #include <omp.h> #define N 100 int main(void) { float a[N], b[N], c[N]; int i; omp_set_dynamic(0); // ensures use of all available threads omp_set_num_threads(20); // sets number of all available threads to 20 /* Initialize arrays a and b. */ for (i = 0; i < N; i++) { a[i] = i * 1.0; b[i] = i * 2.0; } /* Compute values of array c in parallel. */ #pragma omp parallel shared(a, b, c) private(i) { #pragma omp for [nowait] “nowait” is optional for (i = 0; i < N; i++) c[i] = a[i] + b[i]; } printf ("%fn", c[10]); } http://developers.sun.com/solaris/articles/studio_openmp.html
  • 58. Parallelizing Loops - Fortran A Guide to OpenMP PROGRAM VECTOR_ADD USE OMP_LIB PARAMETER (N=100) INTEGER N, I REAL A(N), B(N), C(N) CALL MP_SET_DYNAMIC (.FALSE.) !ensures use of all available threads CALL OMP_SET_NUM_THREADS (20) !sets number of available threads to 20 ! Initialize arrays A and B. DO I = 1, N A(I) = I * 1.0 B(I) = I * 2.0 ENDDO ! Compute values of array C in parallel. !$OMP PARALLEL SHARED(A, B, C), PRIVATE(I) !$OMP DO DO I = 1, N C(I) = A(I) + B(I) ENDDO !$OMP END DO [nowait] ! ... some more instructions !$OMP END PARALLEL PRINT *, C(10) END http://developers.sun.com/solaris/articles/studio_openmp.html
  • 59. Parallelizing Loops - Fortran A Guide to OpenMP PROGRAM VECTOR_ADD USE OMP_LIB PARAMETER (N=100) INTEGER N, I REAL A(N), B(N), C(N) CALL MP_SET_DYNAMIC (.FALSE.) !ensures use of all available threads CALL OMP_SET_NUM_THREADS (20) !sets number of available threads to 20 ! Initialize arrays A and B. DO I = 1, N A(I) = I * 1.0 B(I) = I * 2.0 ENDDO ! Compute values of array C in parallel. !$OMP PARALLEL SHARED(A, B, C), PRIVATE(I) !$OMP DO DO I = 1, N C(I) = A(I) + B(I) “nowait” is optional & added ENDDO !$OMP END DO [nowait] to the closing directive ! ... some more instructions !$OMP END PARALLEL PRINT *, C(10) END http://developers.sun.com/solaris/articles/studio_openmp.html
  • 60. Parallel Loop Scheduling A Guide to OpenMP ● The scheduling is a description that states which loop iterations are assigned to a particular thread ● There are 5 types: – static – each thread calculates its own chunk – dynamic – “frst come, frst served”, managed by runtime – guided – dynamic, with decreasing chunk sizes – auto – determined automatically by compiler or runtime – runtime – defned by OMP_SCHEDULE or omp_set_schedule ● Limitations – only one schedule type may be used for a given loop – the chunk size applies to all threads
  • 61. Parallel Loop Scheduling - Example A Guide to OpenMP Fortran !$OMP PARALLEL SHARED(A, B, C) PRIVATE(I) !$OMP DO SCHEDULE (DYNAMIC,4) DO I = 1, N C(I) = A(I) + B(I) ENDDO !$OMP END DO [nowait] !$OMP END PARALLEL schedule chunk size C/C++ #pragma omp parallel shared(a, b, c) private(i) { #pragma omp for schedule (guided,4) [nowait] for (i = 0; i < N; i++) c[i] = a[i] + b[i]; }
  • 62. The ordered Clause and ordered construct A Guide to OpenMP ● An ordered loop contains code that must execute in serial order ● No clauses are supported ● The ordered code must be inside an ordered construct #pragma omp parallel shared(a, b, c) private(i) ordered clause { #pragma omp for ordered for (i = 0; i <= 99; i++) { // do a lot of stuff concurrently #pragma omp ordered ordered construct { a = i * (b + c); b = i * (a + c); c = i * (a + b); } } }
  • 63. The collapse Clause A Guide to OpenMP ● n levels are collapsed into a combined iteration space ● The schedule and ordered constructs are applied as expected ● E.g., #pragma omp parallel shared(a, b, c) private(i) { #pragma omp for schedule(dynamic,4) collapse(2) for (i = 0; i <= 2; i++) { for (j = 0; j <= 2; j++) { // do stuff for each i,j } } ● Iterations are distributed as (i*j) n-tuples, e.g., <i,j>: – <0,0>; <0,1>; <0,2>; <1,0>; <1,1>; <1,2>; <2,0>; <2,1>; <2,2>;
  • 64. The WORKSHARE Construct (Fortran Only) A Guide to OpenMP ● Provides for parallel execution of code using F90 array syntax ● The clauses supported by the WORKSHARE construct are: private, firstprivate, copyprivate, nowait ● There is an implicit barrier at the end of this construct ● Valid Fortran code enclosed in a workshare construct: – Array & scalar variable assignments – FORALL statements & constructs – WHERE statements & constructs – User defned functions of type ELEMENTAL – OpenMP atomic, critical, & parallel
  • 65. The sections Construct A Guide to OpenMP ● The sections construct defnes code that is to be executed once by exactly one thread ● A barrier is implied ● Supported clauses include: private, firstprivate, lastprivate, reduction, nowait
  • 66. A section Construct Example A Guide to OpenMP #include <stdio.h> #include <omp.h> int square(int n){ return n*n; } int main(void){ int x, y, z, xs, ys, zs; omp_set_dynamic(0); omp_set_num_threads(3); x = 2; y = 3; z = 5; #pragma omp parallel shared(xs,ys,zs) { #pragma omp sections { #pragma omp section { xs = square(x); printf ("id = %d, xs = %dn", omp_get_thread_num(), xs); } #pragma omp section { ys = square(y); printf ("id = %d, ys = %dn", omp_get_thread_num(), ys); } #pragma omp section { zs = square(z); printf ("id = %d, zs = %dn", omp_get_thread_num(), zs); } } } return 0; } http://developers.sun.com/solaris/articles/studio_openmp.h tml
  • 67. A section Construct Example A Guide to OpenMP #pragma omp sections { #pragma omp section { xs = square(x); printf ("id = %d, xs = %dn", omp_get_thread_num(), xs); } #pragma omp section { ys = square(y); printf ("id = %d, ys = %dn", omp_get_thread_num(), ys); } #pragma omp section { zs = square(z); printf ("id = %d, zs = %dn", omp_get_thread_num(), zs); } } thread 0 thread 1 thread 2 t=0 t=1 Time
  • 68. Combined parallel Constructs A Guide to OpenMP ● parallel may be combined with the following: – for, do, sections, WORKSHARE ● Semantics are identical to usage already discussed !$OMP PARALLEL DO SHARED(A, B, C) PRIVATE(I) !$OMP& SCHEDULE(DYNAMIC,4) DO I = 1, N C(I) = A(I) + B(I) ENDDO !$OMP END PARALLEL DO #pragma omp parallel for shared(a, b, c) private(i) schedule (guided,4) { for (i = 0; i < N; i++) c[i] = a[i] + b[i]; }
  • 69. Singling Out Threads with master and single Constructs A Guide to OpenMP ● Code inside a master construct will only be executed by the master thread. ● There is NO implicit barrier associated with master; other threads ignore it. !$OMP MASTER … do stuff !$OMP END MASTER ● Code inside a single construct will be executed by the frst thread to encounter it. ● A single construct contains an implicit barrier that will respect nowait. #pragma omp single [nowait] [copyprivate(list,) { C/C++ … do stuff } note position of clauses in !$OMP SINGLE Fortran Fortran … do stuff !$OMP END SINGLE [nowait] [copyprivate(list,)]
  • 70. The task Construct A Guide to OpenMP ● Tasks were added in 3.0 to handle dynamic and unstructured applications – Recursion – Tree & graph traversals ● OpenMP's execution model based on threads was redefned ● A thread is considered to be an implicit task ● The task construct defnes singular tasks explicitly ● Less overhead than nested parallel regions
  • 71. Threads are now Implicit Tasks A Guide to OpenMP cpu 0 Implicit task 0 p0 cpu 1 Implicit task 1 Main Memory process 0 cpu 2 Implicit task 2 cpu 3 Implicit task 3 process 3
  • 72. Each Thread Conceptually Has Both a tied & untied queue A Guide to OpenMP cpu 0 Implicit task 0 tied (private) p0 untied (public) cpu 1 Implicit task 1 tied (private) Main Memory untied (public) cpu 2 Implicit task 2 tied (private) untied (public) cpu 3 Implicit task 3 process 3 tied (private) untied (public)
  • 73. The task Construct A Guide to OpenMP ● Clauses supported are: if, default, private, firstprivate shared, tied/untied ● By default, all variables are firstprivate ● Tasks can be nested syntactically, but are still asynchronous ● The taskwait directive causes a task to wait until all its children have completed
  • 74. A task Construct C/C++ Example A Guide to OpenMP struct node { struct node *left; struct node *right; }; extern void process(struct node *); void traverse( struct node *p ) { if (p->left) #pragma omp task // p is firstprivate by default traverse(p->left); if (p->right) #pragma omp task // p is firstprivate by default traverse(p->right); process(p); }
  • 75. A task Construct Fortran Example A Guide to OpenMP RECURSIVE SUBROUTINE traverse ( P ) TYPE Node TYPE(Node), POINTER :: left, right END TYPE Node TYPE(Node) :: P IF (associated(P%left)) THEN !$OMP TASK ! P is firstprivate by default call traverse(P%left) !$OMP END TASK ENDIF IF (associated(P%right)) THEN !$OMP TASK ! P is firstprivate by default call traverse(P%right) !$OMP END TASK ENDIF CALL process ( P ) END SUBROUTINE
  • 76. Mutual Exclusion: critical, atomic, and omp_lock_t A Guide to OpenMP ● Some code must be executed by one thread at a time ● Efectively serializes the threads execution of the code ● Such code regions often called critical sections ● OpenMP provides 3 ways to achieve mutual exclusion – The critical construct encloses a critical section – The atomic construct enclose updates to shared variables – A low level, general purpose locking mechanism
  • 77. The critical Construct A Guide to OpenMP ● The critical construct encloses code that should be executed by all threads, just in some serial order #pragma omp parallel { #pragma omp critical { // some code } } ● The efect is equivalent to a lock protecting the code
  • 78. A critical Construct Example A Guide to OpenMP #pragma omp parallel shared(a, b, c) private(i) { #pragma omp critical { // // do stuff (one thread at a time) // } } thread 1 thread 0 thread 2 t=0 t=1 t=2 Time Note: Encountering thread order not guaranteed!
  • 79. The Named critical Construct A Guide to OpenMP ● Names may be applied to critical constructs. #pragma omp parallel { #pragma omp critical(a) { // some code } #pragma omp critical(b) { // some code } #pragma omp critical(c) { // some code } } ● The efect is equivalent to using a diferent lock for each section.
  • 80. A Named critical Construct Example A Guide to OpenMP #include <stdio.h> #include <omp.h> #define N 100 int main(void) { float a[N], b[N], c[3]; int i; /* Initialize arrays a and b. */ for (i = 0; i < N; i++) { a[i] = i * 1.0 + 1.0; b[i] = i * 2.0 + 2.0; } /* Compute values of array c in parallel. */ #pragma omp parallel shared(a, b, c) private(i) { #pragma omp critical(a) { for (i = 0; i < N; i++) C[ 0] += a[i] + b[i]; printf("%fn",c[0]); } #pragma omp critical(b) { for (i = 0; i < N; i++) c[1] += a[i] + b[i]; printf("%fn",c[1]); } #pragma omp critical(c) { for (i = 0; i < N; i++) c[2] += a[i] + b[i]; printf("%fn",c[2]); } } }
  • 81. A Named critical Construct Example A Guide to OpenMP #pragma omp critical(a) { // some code } #pragma omp critical(b) { Note: // some code Encountering thread } order not guaranteed! #pragma omp critical(c) { // some code } thread 1 thread 0 thread 2 thread 1 thread 0 thread 2 thread 1 thread 0 thread 2 t=0 t=1 t=2 t=3 t=1 Time
  • 82. The atomic Construct for Safely Updating Shared Variables A Guide to OpenMP ● Protected writes to shared variables ● Typically lighter weight than a critical construct #include <stdio.h> #include <omp.h> int main(void) { int count = 0; #pragma omp parallel shared(count) { #pragma omp atomic count++; } printf("Number of threads: %dn",count); } Note: Encountering thread order not guaranteed!
  • 83. Locks in OpenMP A Guide to OpenMP ● omp_lock_t, omp_lock_kind ● omp_nest_lock_t, omp_nest_lock_kind ● Threads set/unset locks ● Nested locks can be set multiple times by the same thread before releasing them ● More fexible than critical and atomic constructs – the behavior of both is defned by behavior of locks – E.g., “named” critical sections behave as if each critical section used a diferent lock variable ● Locking routines imply a memory flush
  • 84. Single Locks vs Nested Locks A Guide to OpenMP ● Single locks may be set once and only once ● Single locks must be unset by the owning thread, but only once ● Nested locks may be set multiple times, but only by the thread initially setting it ● Nested lock must be unset for each time it is set ● Nested locks are not unlocked until the lock count has been decremented to 0
  • 85. Using Locks in OpenMP A Guide to OpenMP #include <stdlib.h> #include <stdio.h> #include <omp.h> /* this example mimics the behavior of the atomic construct used to safely update locked variables */ int main() { int x; omp_lock_t lck; /* master thread initialized lock var before fork*/ omp_init_lock(&lck); x = 0; #pragma omp parallel shared (x,lck) { omp_set_lock(&lck); /* threads wait to acquire lock */ x = x+1; /* thread holding lock updates x */ omp_unset_lock(&lck); /* thread releases lock */ } /* master thread destroy lock after join */ omp_destroy_lock (&lck); }
  • 86. Using Nested Locks in OpenMP A Guide to OpenMP #include <omp.h> typedef struct { int a,b; omp_nest_lock_t lck; } pair; int work1(); int work2(); int work3(); void incr_a(pair *p, int a) { /* Called only from incr_pair, no need to lock. */ p->a += a; } void incr_b(pair *p, int b) { /* Called both from incr_pair and elsewhere, */ /* so need a nestable lock. */ omp_set_nest_lock(&p->lck); p->b += b; omp_unset_nest_lock(&p->lck); } void incr_pair(pair *p, int a, int b) { omp_set_nest_lock(&p->lck); incr_a(p, a); incr_b(p, b); omp_unset_nest_lock(&p->lck); } void a45(pair *p) { #pragma omp parallel sections { #pragma omp section incr_pair(p, work1(), work2()); #pragma omp section incr_b(p, work3()); } }
  • 87. Fortran Programming Tips A Guide to OpenMP ● Directives are case insensitive ● In fxed form Fortran OpenMP directives can hide behind the following “sentinals” !$[OMP],c$[OMP],*$[OMP] ● Free form requires “!$” ● Sentinals can enable conditional compilation !$ omp_set_num_threads(n) ● Fortran directives should start in column 0 ● Long directive continuations take a form similar to: !$OMP PARALLEL DEFAULT(NONE) !$OMP& SHARED(INP,OUTP,BOXL,TEMP,RHO,NSTEP,TSTEP,X,Y,Z,VX,VY,VZ,BOXL) !$OMP& SHARED(XO,YO,ZO,TSTEP,V2T,VXT,VYT,VZT,IPRINT,ISTEP,ETOT,ERUN) !$OMP& SHARED(FX,FY,FZ,PENER) !$OMP& PRIVATE(I)
  • 88. C/C++ Programming Tips A Guide to OpenMP ● No line continuations, entire directive on single line ● Directives are case sensitive ● No conditional compilation sentinals, use “#ifdef”, etc ● Code Style: int main (...) { ... #pragma parallel all #pragmas in { #pragma omp sections col. 0 { #pragma omp section { xs = square(x); braces indented } printf ("id = %d, xs = %dn", omp_get_thread_num(), xs); as usual #pragma omp section { ys = square(y); printf ("id = %d, ys = %dn", omp_get_thread_num(), ys); } } } return 0; /* end main */ }
  • 89. General Programming Tips A Guide to OpenMP ● Minimize parallel constructs ● Use combined constructs, if it doesn't violate the above ● Minimize shared variables, maximize private ● Minimize barriers, but don't sacrifce safety ● When inserting OpenMP into existing code – Use a disciplined, iterative cycle – test against serial version – Use barriers liberally – Optimize OpenMP & asynchronize last ● When starting from scratch – Start with an optimized serial version
  • 90. The Future of OpenMP A Guide to OpenMP ● Vendor buy-in and R&D support is as strong as ever ● Must remain relevant ● Active areas of research: – Refnement to tasking model (scheduling, etc) – User defned reductions (UDRs) – Accelerators & heterogeneous environments – Error handling – Hybrid models ● Scaling issues being addressed: – Thousands of threads – Data locality – More fexible & efcient synchronization
  • 91. A Practical Example – Calculating π A Guide to OpenMP static long num_steps = 100000; double step; Mathematically, we know: void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=0;i<= num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); And this can be approximated } pi = step * sum; as a sum of the area of rectangles: } Where each rectangle has a width of Δx and a height of F(xi) at the middle of interval i.
  • 92. A Naïve Parallel Program - Calculating π A Guide to OpenMP #include <omp.h> Promote scalar to an array static long num_steps = 100000; double step; dimensioned by number of #define NUM_THREADS 2 threads to avoid race void main () condition { int i, id, nthreads; double x, pi, sum[NUM_THREADS]; step = 1.0/(double) num_steps; omp_set_num_threads(NUM_THREADS); Can't assume the number #pragma omp parallel private (i, id, x) of threads requested; use { single to prevent conficts. id = omp_get_thread_num(); #pragma omp single nthreads = omp_get_num_threads(); for (i=id, sum[id]=0.0;i< num_steps; i=i+nthreads){ False sharing of the sum x = (i+0.5)*step; array will afect performance sum[id] += 4.0/(1.0+x*x); } } for(i=0, pi=0.0;i<nthreads;i++)pi += sum[i] * step; }
  • 93. A Slightly More Efcient Parallel Program - Calculating π A Guide to OpenMP #include <omp.h> static long num_steps = 100000; double step; #define NUM_THREADS 2 void main () { int i, id, nthreads; double x, pi, sum; step = 1.0/(double) num_steps; omp_set_num_threads(NUM_THREADS); #pragma omp parallel private (i, id, x, sum) { id = omp_get_thread_num(); #pragma omp single nthreads = omp_get_num_threads(); for (i=id, sum=0.0;i< num_steps; i=i+nthreads){ No array, so no false sharing x = (i+0.5)*step; sum += 4.0/(1.0+x*x); } #pragma omp critical pi += sum * step; Note: this method of combining partial sums } doesn't scale very well. }
  • 94. A Simple & Efcient Parallel Program - Calculating π A Guide to OpenMP #include <omp.h> static long num_steps = 100000; double step; #define NUM_THREADS 2 For good OpenMP void main () implementations, reduction { is more scalable than a int i; double x, pi, sum = 0.0; critical construct step = 1.0/(double) num_steps; omp_set_num_threads(NUM_THREADS); #pragma omp parallel for private(x) reduction(+:sum) for (i=0;i<= num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; } i private by default Note: a parallel program is created without changing any existing code and adding 4 simple lines.
  • 95. The Final Parallel Program - Calculating π A Guide to OpenMP #include <omp.h> static long num_steps = 100000; double step; #define NUM_THREADS 2 For good OpenMP void main () implementations, reduction { is more scalable than a int i; double x, pi, sum = 0.0; critical construct step = 1.0/(double) num_steps; omp_set_num_threads(NUM_THREADS); #pragma omp parallel for private(x) reduction(+:sum) for (i=0;i<= num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; } i private by default In practice, the number of threads is set using the environmental variable, OMP_NUM_THREADS.
  • 96. Additional Resources A Guide to OpenMP ● http://www.cs.uh.edu/~hpctools ● http://www.compunity.org ● http://www.openmp.org – Specifcation 3.0 ● “Using OpenMP”, Chapman, et. al. Covers through 2.5
  • 97. Tutorial Contributors A Guide to OpenMP ● Brett Estrade ● Dr. Barbara Chapman ● Amrita Banerjee The latest version of this presentation may be downloaded at: http://www.cs.uh.edu/~hpctools/OpenMP All feedback is welcome and appreciated. Please send all feedback to Brett: estrabd@cs.uh.edu