OpenMP
Speaker :呂宗螢
Date : 2007/06/01
Embedded and Parallel Systems Lab2
Outline
Embedded and Parallel Systems Lab3
OpenMP
 OpenMP 2.5
 Multi-threaded & Share memory
 Fortran 、 C / C++
 基本語法
 #pragma omp directive [clause]
 OpenMP 需求及支援環境
 Windows
 Virtual studio 2005 standard
 Intel ® C++ Compiler 9.1
 Linux
 gcc 4.2.0
 Omni
 Xbox 360 & PS3
Embedded and Parallel Systems Lab4
Windows
 於程式最前面 #include <omp.h>
 Virtual studio 2005 standard
專案 / 專案屬性 / 組態屬性 /c/c++/ 語言
將 OpenMP 支援改為 yes
Embedded and Parallel Systems Lab5
Linux
 gcc 4.2
 如果沒有請至 GNU gcc 下載 gcc
http://gcc.gnu.org/
以 gcc 4.2.1 為例
1. 解開 gcc
tar -zxvf gcc-4.2.1.tar.gz
2. 進到該目錄
cd gcc-4.2.1
3. 設定 configure ,並安裝至 /opt/gcc-4.2.1
./configure -prefix=/opt/gcc-4.2.1/
4. 編繹
make
5. 安裝
make install
Embedded and Parallel Systems Lab6
OpenMP Constructs
Embedded and Parallel Systems Lab7
Types of Work-Sharing Constructs
 Loop : shares iterations of a loop
across the team. Represents a type of
"data parallelism".
Source : http://www.llnl.gov/computing/tutorials/openMP/
 Sections : breaks work into separate,
discrete sections. Each section is executed
by a thread. Can be used to implement a
type of "functional parallelism".
Embedded and Parallel Systems Lab8
Types of Work-Sharing Constructs
 single :將程式於一個執行緒執行 ( 於一個子執行緒執行,但不會在
master thread 執行 )
Source : http://www.llnl.gov/computing/tutorials/openMP/
Embedded and Parallel Systems Lab9
Loop working sharing
#pragma omp parallel for
for( int i , i <10000, i++)
for( int j , j <100 , j++)
function(i);
#pragma omp parallel
{ 大括號必須斷行,不能接於 parallel 後
#pragma omp for
for( int i , i <10000, i++)
for( int j , j <100 , j++)
function(i);
}
=
parallel for 只能使用迴圈的 index 為 int 型態,且執行次數是可預知的
Thread 0 (Master)
for( i = 0 , i <5000, i++)
for( int j , j <100 , j++)
function(i);
Thread 1
for( i = 5000 , i <10000, i++)
for( int j , j <100 , j++)
function(i);
於雙執行緒的 cpu 執行時情形
Embedded and Parallel Systems Lab10
OpenMP example : log.cpp
#include <omp.h>
#pragma omp parallel for num_threads(2) // 將 for 迴圈平均分給 2 個 threads
for (y=2;y<BufSizeY-2;y++)
for (x=2;x<BufSizeX-2;x++)
for (z=0;z<BufSizeBand;z++) {
addr=(y*BufSizeX+x)*BufSizeBand+z;
ans = (BYTE)(*(InBuf+addr))*16+
(BYTE)(*(InBuf+((y*BufSizeX+x+1)*BufSizeBand+z)))*(-2) +
(BYTE)(*(InBuf+((y*BufSizeX+x-1)*BufSizeBand+z)))*(-2) +
(BYTE)(*(InBuf+(((y+1)*BufSizeX+x)*BufSizeBand+z)))*(-2)+
(BYTE)(*(InBuf+(((y-1)*BufSizeX+x)*BufSizeBand+z)))*(-2)+
(BYTE)(*(InBuf+((y*BufSizeX+x+2)*BufSizeBand+z)))*(-1)+
(BYTE)(*(InBuf+((y*BufSizeX+x-2)*BufSizeBand+z)))*(-1)+
(BYTE)(*(InBuf+(((y+2)*BufSizeX+x)*BufSizeBand+z)))*(-1)+
(BYTE)(*(InBuf+(((y-2)*BufSizeX+x)*BufSizeBand+z)))*(-1)+
(BYTE)(*(InBuf+(((y+1)*BufSizeX+x+1)*BufSizeBand+z)))*(-1) +
(BYTE)(*(InBuf+(((y+1)*BufSizeX+x-1)*BufSizeBand+z)))*(-1)+
(BYTE)(*(InBuf+(((y-1)*BufSizeX+x+1)*BufSizeBand+z)))*(-1)+
(BYTE)(*(InBuf+(((y-1)*BufSizeX+x-1)*BufSizeBand+z)))*(-1);
*(OutBuf+addr)=abs(ans)/8;
}
Embedded and Parallel Systems Lab11
Source image Out image
Convert Log Image
Embedded and Parallel Systems Lab12
Sections Working Share
int main(int argc, char* argv[]) {
#pragma omp parallel sections
{
#pragma omp section
{
toPNG();
}
#pragma omp section
{
toJPG();
}
#pragma omp section
{
toTIF();
}
}
}
Input image
toPNG
toJPG
toTIF
Embedded and Parallel Systems Lab13
OpenMP notice
int Fe[10];
Fe[0] = 0;
Fe[1] = 1;
#pragma omp parallel for num_threads(2)
for( i = 2; i < 10; ++ i )
Fe[i] = Fe[i-1] + Fe[i-2];
Data dependent
#pragma omp parallel
{
#pragma omp for
for( int i = 0; i < 1000000; ++ i )
sum += i;
}
Race conditions
Embedded and Parallel Systems Lab14
OpenMP notice
 DeadLock
#pragma omp parallel
private(me)
{
int me;
me = omp_get_thread_num ();
if (me == 0) goto Master;
#pragma omp barrier
Master:
#pragma omp single
write(*,*) ”done”
}
Embedded and Parallel Systems Lab15
OpenMP example:matrix(1)
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define RANDOM_SEED 2882 //random seed
#define VECTOR_SIZE 4 //sequare matrix width the same to height
#define MATRIX_SIZE (VECTOR_SIZE * VECTOR_SIZE) //total size of
MATRIX
int main(int argc, char *argv[]){
int i,j,k;
int node_id;
int *AA; //sequence use & check the d2mce right or fault
int *BB; //sequence use
int *CC; //sequence use
int computing;
int _vector_size = VECTOR_SIZE;
int _matrix_size = MATRIX_SIZE;
char c[10];
Embedded and Parallel Systems Lab16
OpenMP example:matrix(2)
if(argc > 1){
for( i = 1 ; i < argc ;){
if(strcmp(argv[i],"-s") == 0){
_vector_size = atoi(argv[i+1]);
_matrix_size =_vector_size * _vector_size;
i+=2;
}
else{
printf("the argument only have:n");
printf("-s: the size of vector ex: -s 256n");
return 0;
}
}
}
AA =(int *)malloc(sizeof(int) * _matrix_size);
BB =(int *)malloc(sizeof(int) * _matrix_size);
CC =(int *)malloc(sizeof(int) * _matrix_size);
Embedded and Parallel Systems Lab17
OpenMP example:matrix(3)
srand( RANDOM_SEED );
/* create matrix A and Matrix B */
for( i=0 ; i< _matrix_size ; i++){
AA[i] = rand()%10;
BB[i] = rand()%10;
}
/* computing C = A * B */
#pragma omp parallel for private(computing, j , k)
for( i=0 ; i < _vector_size ; i++){
for( j=0 ; j < _vector_size ; j++){
computing =0;
for( k=0 ; k < _vector_size ; k++)
computing += AA[ i*_vector_size + k ] *
BB[ k*_vector_size + j ];
CC[ i*_vector_size + j ] = computing;
}
}
Embedded and Parallel Systems Lab18
OpenMP example:matrix(4)
printf("nVector_size:%dn", _vector_size);
printf("Matrix_size:%dn", _matrix_size);
printf("Processing time:%fn", time);
return 0;
}
Embedded and Parallel Systems Lab19
OpenMP Directive Table
Directive Description
atomic Specifies that a memory location that will be updated atomically.
barrier
Synchronizes all threads in a team; all threads pause at the barrier, until all threads execute the
barrier.
critical Specifies that code is only executed on one thread at a time.
flush Specifies that all threads have the same view of memory for all shared objects.
for Causes the work done in a for loop inside a parallel region to be divided among threads.
master Specifies that only the master threadshould execute a section of the program.
ordered Specifies that code under a parallelized for loop should be executed like a sequential loop.
parallel Defines a parallel region, which is code that will be executed by multiple threads in parallel.
sections Identifies code sections to be divided among all threads.
single
Lets you specify that a section of code should be executed on a single thread, not necessarily
the master thread.
threadprivate Specifies that a variable is private to a thread.
Source :http://msdn2.microsoft.com/zh-tw/library/0ca2w8dk(VS.80).aspx
Embedded and Parallel Systems Lab20
OpenMP Clause Table
Clause Description
copyin Allows threads to access the master thread's value, for a threadprivate variable.
copyprivate Specifies that one or more variables should be shared among all threads.
default Specifies the behavior of unscoped variables in a parallel region.
firstprivate
Specifies that each thread should have its own instance of a variable, and that the variable should
be initialized with the value of the variable, because it exists before the parallel construct.
if Specifies whether a loop should be executed in parallel or in serial.
lastprivate
Specifies that the enclosing context's version of the variable is set equal to the private version of
whichever thread executes the final iteration (for-loop construct) or last section (#pragma sections).
nowait Overrides the barrier implicit in a directive.
num_threads Sets the number of threads in a thread team.
ordered Required on a parallel for statement if an ordered directive is to be used in the loop.
private Specifies that each thread should have its own instance of a variable.
reduction
Specifies that one or more variables that are private to each thread are the subject of a reduction
operation at the end of the parallel region.
schedule Applies to the for directive. Have fourt method : static 、 dynamic 、 guided 、 runtime
shared Specifies that one or more variables should be shared among all threads.
Source :http://msdn2.microsoft.com/zh-tw/library/0ca2w8dk(VS.80).aspx
Embedded and Parallel Systems Lab21
Reference
 Michael J. Quinn, “Parallel Programming in C with MPI and OpenMP”
 Introduction to Parallel Computing  
http://www.llnl.gov/computing/tutorials/parallel_comp/
 OpenMP standard http://www.openmp.org/drupal/
 OpenMP MSDN tutorial
http://msdn2.microsoft.com/en-us/library/tt15eb9t(VS.80).aspx
 OpenMP tutorial http://www.llnl.gov/computing/tutorials/openMP/#DO
 Kang Su Gatlin , Pete Isensee, “Reap the Benefits of Multithreading without
All the Work” ,MSDN Magazine

OpenMP

  • 1.
  • 2.
    Embedded and ParallelSystems Lab2 Outline
  • 3.
    Embedded and ParallelSystems Lab3 OpenMP  OpenMP 2.5  Multi-threaded & Share memory  Fortran 、 C / C++  基本語法  #pragma omp directive [clause]  OpenMP 需求及支援環境  Windows  Virtual studio 2005 standard  Intel ® C++ Compiler 9.1  Linux  gcc 4.2.0  Omni  Xbox 360 & PS3
  • 4.
    Embedded and ParallelSystems Lab4 Windows  於程式最前面 #include <omp.h>  Virtual studio 2005 standard 專案 / 專案屬性 / 組態屬性 /c/c++/ 語言 將 OpenMP 支援改為 yes
  • 5.
    Embedded and ParallelSystems Lab5 Linux  gcc 4.2  如果沒有請至 GNU gcc 下載 gcc http://gcc.gnu.org/ 以 gcc 4.2.1 為例 1. 解開 gcc tar -zxvf gcc-4.2.1.tar.gz 2. 進到該目錄 cd gcc-4.2.1 3. 設定 configure ,並安裝至 /opt/gcc-4.2.1 ./configure -prefix=/opt/gcc-4.2.1/ 4. 編繹 make 5. 安裝 make install
  • 6.
    Embedded and ParallelSystems Lab6 OpenMP Constructs
  • 7.
    Embedded and ParallelSystems Lab7 Types of Work-Sharing Constructs  Loop : shares iterations of a loop across the team. Represents a type of "data parallelism". Source : http://www.llnl.gov/computing/tutorials/openMP/  Sections : breaks work into separate, discrete sections. Each section is executed by a thread. Can be used to implement a type of "functional parallelism".
  • 8.
    Embedded and ParallelSystems Lab8 Types of Work-Sharing Constructs  single :將程式於一個執行緒執行 ( 於一個子執行緒執行,但不會在 master thread 執行 ) Source : http://www.llnl.gov/computing/tutorials/openMP/
  • 9.
    Embedded and ParallelSystems Lab9 Loop working sharing #pragma omp parallel for for( int i , i <10000, i++) for( int j , j <100 , j++) function(i); #pragma omp parallel { 大括號必須斷行,不能接於 parallel 後 #pragma omp for for( int i , i <10000, i++) for( int j , j <100 , j++) function(i); } = parallel for 只能使用迴圈的 index 為 int 型態,且執行次數是可預知的 Thread 0 (Master) for( i = 0 , i <5000, i++) for( int j , j <100 , j++) function(i); Thread 1 for( i = 5000 , i <10000, i++) for( int j , j <100 , j++) function(i); 於雙執行緒的 cpu 執行時情形
  • 10.
    Embedded and ParallelSystems Lab10 OpenMP example : log.cpp #include <omp.h> #pragma omp parallel for num_threads(2) // 將 for 迴圈平均分給 2 個 threads for (y=2;y<BufSizeY-2;y++) for (x=2;x<BufSizeX-2;x++) for (z=0;z<BufSizeBand;z++) { addr=(y*BufSizeX+x)*BufSizeBand+z; ans = (BYTE)(*(InBuf+addr))*16+ (BYTE)(*(InBuf+((y*BufSizeX+x+1)*BufSizeBand+z)))*(-2) + (BYTE)(*(InBuf+((y*BufSizeX+x-1)*BufSizeBand+z)))*(-2) + (BYTE)(*(InBuf+(((y+1)*BufSizeX+x)*BufSizeBand+z)))*(-2)+ (BYTE)(*(InBuf+(((y-1)*BufSizeX+x)*BufSizeBand+z)))*(-2)+ (BYTE)(*(InBuf+((y*BufSizeX+x+2)*BufSizeBand+z)))*(-1)+ (BYTE)(*(InBuf+((y*BufSizeX+x-2)*BufSizeBand+z)))*(-1)+ (BYTE)(*(InBuf+(((y+2)*BufSizeX+x)*BufSizeBand+z)))*(-1)+ (BYTE)(*(InBuf+(((y-2)*BufSizeX+x)*BufSizeBand+z)))*(-1)+ (BYTE)(*(InBuf+(((y+1)*BufSizeX+x+1)*BufSizeBand+z)))*(-1) + (BYTE)(*(InBuf+(((y+1)*BufSizeX+x-1)*BufSizeBand+z)))*(-1)+ (BYTE)(*(InBuf+(((y-1)*BufSizeX+x+1)*BufSizeBand+z)))*(-1)+ (BYTE)(*(InBuf+(((y-1)*BufSizeX+x-1)*BufSizeBand+z)))*(-1); *(OutBuf+addr)=abs(ans)/8; }
  • 11.
    Embedded and ParallelSystems Lab11 Source image Out image Convert Log Image
  • 12.
    Embedded and ParallelSystems Lab12 Sections Working Share int main(int argc, char* argv[]) { #pragma omp parallel sections { #pragma omp section { toPNG(); } #pragma omp section { toJPG(); } #pragma omp section { toTIF(); } } } Input image toPNG toJPG toTIF
  • 13.
    Embedded and ParallelSystems Lab13 OpenMP notice int Fe[10]; Fe[0] = 0; Fe[1] = 1; #pragma omp parallel for num_threads(2) for( i = 2; i < 10; ++ i ) Fe[i] = Fe[i-1] + Fe[i-2]; Data dependent #pragma omp parallel { #pragma omp for for( int i = 0; i < 1000000; ++ i ) sum += i; } Race conditions
  • 14.
    Embedded and ParallelSystems Lab14 OpenMP notice  DeadLock #pragma omp parallel private(me) { int me; me = omp_get_thread_num (); if (me == 0) goto Master; #pragma omp barrier Master: #pragma omp single write(*,*) ”done” }
  • 15.
    Embedded and ParallelSystems Lab15 OpenMP example:matrix(1) #include <omp.h> #include <stdio.h> #include <stdlib.h> #define RANDOM_SEED 2882 //random seed #define VECTOR_SIZE 4 //sequare matrix width the same to height #define MATRIX_SIZE (VECTOR_SIZE * VECTOR_SIZE) //total size of MATRIX int main(int argc, char *argv[]){ int i,j,k; int node_id; int *AA; //sequence use & check the d2mce right or fault int *BB; //sequence use int *CC; //sequence use int computing; int _vector_size = VECTOR_SIZE; int _matrix_size = MATRIX_SIZE; char c[10];
  • 16.
    Embedded and ParallelSystems Lab16 OpenMP example:matrix(2) if(argc > 1){ for( i = 1 ; i < argc ;){ if(strcmp(argv[i],"-s") == 0){ _vector_size = atoi(argv[i+1]); _matrix_size =_vector_size * _vector_size; i+=2; } else{ printf("the argument only have:n"); printf("-s: the size of vector ex: -s 256n"); return 0; } } } AA =(int *)malloc(sizeof(int) * _matrix_size); BB =(int *)malloc(sizeof(int) * _matrix_size); CC =(int *)malloc(sizeof(int) * _matrix_size);
  • 17.
    Embedded and ParallelSystems Lab17 OpenMP example:matrix(3) srand( RANDOM_SEED ); /* create matrix A and Matrix B */ for( i=0 ; i< _matrix_size ; i++){ AA[i] = rand()%10; BB[i] = rand()%10; } /* computing C = A * B */ #pragma omp parallel for private(computing, j , k) for( i=0 ; i < _vector_size ; i++){ for( j=0 ; j < _vector_size ; j++){ computing =0; for( k=0 ; k < _vector_size ; k++) computing += AA[ i*_vector_size + k ] * BB[ k*_vector_size + j ]; CC[ i*_vector_size + j ] = computing; } }
  • 18.
    Embedded and ParallelSystems Lab18 OpenMP example:matrix(4) printf("nVector_size:%dn", _vector_size); printf("Matrix_size:%dn", _matrix_size); printf("Processing time:%fn", time); return 0; }
  • 19.
    Embedded and ParallelSystems Lab19 OpenMP Directive Table Directive Description atomic Specifies that a memory location that will be updated atomically. barrier Synchronizes all threads in a team; all threads pause at the barrier, until all threads execute the barrier. critical Specifies that code is only executed on one thread at a time. flush Specifies that all threads have the same view of memory for all shared objects. for Causes the work done in a for loop inside a parallel region to be divided among threads. master Specifies that only the master threadshould execute a section of the program. ordered Specifies that code under a parallelized for loop should be executed like a sequential loop. parallel Defines a parallel region, which is code that will be executed by multiple threads in parallel. sections Identifies code sections to be divided among all threads. single Lets you specify that a section of code should be executed on a single thread, not necessarily the master thread. threadprivate Specifies that a variable is private to a thread. Source :http://msdn2.microsoft.com/zh-tw/library/0ca2w8dk(VS.80).aspx
  • 20.
    Embedded and ParallelSystems Lab20 OpenMP Clause Table Clause Description copyin Allows threads to access the master thread's value, for a threadprivate variable. copyprivate Specifies that one or more variables should be shared among all threads. default Specifies the behavior of unscoped variables in a parallel region. firstprivate Specifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable, because it exists before the parallel construct. if Specifies whether a loop should be executed in parallel or in serial. lastprivate Specifies that the enclosing context's version of the variable is set equal to the private version of whichever thread executes the final iteration (for-loop construct) or last section (#pragma sections). nowait Overrides the barrier implicit in a directive. num_threads Sets the number of threads in a thread team. ordered Required on a parallel for statement if an ordered directive is to be used in the loop. private Specifies that each thread should have its own instance of a variable. reduction Specifies that one or more variables that are private to each thread are the subject of a reduction operation at the end of the parallel region. schedule Applies to the for directive. Have fourt method : static 、 dynamic 、 guided 、 runtime shared Specifies that one or more variables should be shared among all threads. Source :http://msdn2.microsoft.com/zh-tw/library/0ca2w8dk(VS.80).aspx
  • 21.
    Embedded and ParallelSystems Lab21 Reference  Michael J. Quinn, “Parallel Programming in C with MPI and OpenMP”  Introduction to Parallel Computing   http://www.llnl.gov/computing/tutorials/parallel_comp/  OpenMP standard http://www.openmp.org/drupal/  OpenMP MSDN tutorial http://msdn2.microsoft.com/en-us/library/tt15eb9t(VS.80).aspx  OpenMP tutorial http://www.llnl.gov/computing/tutorials/openMP/#DO  Kang Su Gatlin , Pete Isensee, “Reap the Benefits of Multithreading without All the Work” ,MSDN Magazine