Introduction to Parallel Programming

슈퍼컴퓨팅 교육 - UNIST

Parallel Programming

CONTENT
S
I.

Introduction to Parallel Computing

II.

Parallel Programming using OpenMP

III. Parallel Programming using MPI

I. Introduction to Parallel Computing

병렬 처리 (1/3)

병렬 처리란, 순차적으로 진행되는 계산영역을

여러 개로 나누어 각각을 여러 프로세서에서
동시에 수행 되도록 하는 것

병렬 처리 (2/3)
순차실행

병렬실행

Inputs

Outputs

병렬 처리 (3/3)
주된 목적 : 더욱 큰 문제를 더욱 빨리 처리하는 것


프로그램의 wall-clock time 감소



해결할 수 있는 문제의 크기 증가

병렬 컴퓨팅 계산 자원


여러 개의 프로세서(CPU)를 가지는 단일 컴퓨터



네트워크로 연결된 다수의 컴퓨터

왜 병렬인가?
고성능 단일 프로세서 시스템 개발의 제한


전송속도의 한계 (구리선 : 9 cm/nanosec)



소형화의 한계



경제적 제한

보다 빠른 네트워크, 분산 시스템, 다중 프로세서 시스템 아키텍처의 등장  병렬 컴퓨팅 환경

상대적으로 값싼 프로세서를 여러 개 묶어 동시에 사용함
으로써 원하는 성능이득 기대

프로그램과 프로세스
프로세스는 보조 기억 장치에 하나의 파일로서 저장되어 있던 실행 가능한 프로그램이 로딩되어 운영
체제(커널)의 실행 제어 상태에 놓인 것



프로그램 : 보조 기억 장치에 저장



프로세스 : 컴퓨터 시스템에 의하여 실행 중인 프로그램



태스크 = 프로세스

프로세스
프로그램 실행을 위한 자원 할당의 단위가 되고, 한 프로그램에서 여러 개 실행 가능
다중 프로세스를 지원하는 단일 프로세서 시스템


자원 할당의 낭비, 문맥교환으로 인한 부하 발생



문맥교환

• 어떤 순간 한 프로세서에서 실행 중인 프로세스는 항상 하나
• 현재 프로세스 상태 저장  다른 프로세스 상태 적재
분산메모리 병렬 프로그래밍 모델의 작업할당 기준

스레드
프로세스에서 실행의 개념만을 분리한 것


프로세스 = 실행단위(스레드) + 실행환경(공유자원)



하나의 프로세스에 여러 개 존재가능



같은 프로세스에 속한 다른 스레드와 실행환경을 공유

다중 스레드를 지원하는 단일 프로세서 시스템


다중 프로세스보다 효율적인 자원 할당



다중 프로세스보다 효율적인 문맥교환

공유 메모리 병렬 프로그래밍 모델의 작업할당 기준

프로세스와 스레드

하나의 스레드를 갖는 3개의 프로세스

스레드

프로세스

3개의 스레드를 갖는 하나의 프로세스

병렬성 유형
데이터 병렬성 (Data Parallelism)


도메인 분해 (Domain Decomposition)



각 태스크는 서로 다른 데이터를 가지고 동일한 일련의 계산을 수행

태스크 병렬성 (Task Parallelism)


기능적 분해 (Functional Decomposition)



각 태스크는 같거나 또는 다른 데이터를 가지고 서로 다른 계산을 수행

데이터 병렬성 (1/3)

데이터 병렬성 : 도메인 분해

Problem Data Set

Task 1

Task 2

Task 3

Task 4


코드 예) : 행렬의 곱셈 (OpenMP)

Serial Code

Parallel Code
!$OMP PARALLEL DO

DO K=1,N

DO K=1,N

DO J=1,N

DO J=1,N

DO I=1,N
C(I,J) = C(I,J) +

DO I=1,N
C(I,J) = C(I,J) +

(A(I,K)*B(K,J))
END DO
END DO
END DO

A(I,K)*B(K,J)
END DO
END DO
END DO
!$OMP END PARALLEL DO

데이터 분해 (프로세서 4개:K=1,20일 때)

Process

Proc0
Proc1
Proc2
Proc3

Iterations of K

K =
K =

1:5
6:10

K = 11:15
K = 16:20

Data Elements

A(I,1:5)
B(1:5,J)
A(I,6:10)
B(6:10,J)
A(I,11:15)
B(11:15,J)
A(I,16:20)
B(16:20,J)

태스크 병렬성 (1/3)
태스크 병렬성 : 기능적 분해

Problem Instruction Set

Task 1

Task 2

Task 3

Task 4

코드 예) : (OpenMP)

Serial Code

Parallel Code

PROGRAM MAIN
…
CALL interpolate()
CALL compute_stats()
CALL gen_random_params()
…
END

PROGRAM MAIN
…
!$OMP PARALLEL
!$OMP SECTIONS
CALL interpolate()
!$OMP SECTION
!$OMP SECTION
!$OMP END SECTIONS
!$OMP END PARALLEL
…
END

태스크 분해 (3개의 프로세서에서 동시 수행)

Process

Code

Proc0

CALL interpolate()

Proc1


Proc2


병렬 아키텍처 (1/2)

Processor Organizations

Single Instruction,
Single Instruction,
Single Data Stream Multiple Data Stream
(SISD)
(SIMD)

Multiple Instruction, Multiple Instruction,
Single Data Stream Multiple Data Stream
(MIMD)
(MISD)

Uniprocessor
Vector
Processor

Shared memory
Array
Processor (tightly coupled)

Distributed memory
(loosely coupled)

Clusters
Symmetric
multiprocessor
(SMP)

Non-uniform
Memory
Access
(NUMA)

병렬 아키텍처 (2/2)
최근의 고성능 시스템 : 분산-공유 메모리 지원


소프트 웨어적 DSM (Distributed Shared Memory) 구현

• 공유 메모리 시스템에서 메시지 패싱 지원
• 분산 메모리 시스템에서 변수 공유 지원


하드웨어적 DSM 구현 : 분산-공유 메모리 아키텍처

• 분산 메모리 시스템의 각 노드를 공유 메모리 시스템으로 구성
• NUMA : 사용자들에게 하나의 공유 메모리 아키텍처로 보여짐
ex) Superdome(HP), Origin 3000(SGI)
• SMP 클러스터 : SMP로 구성된 분산 시스템으로 보여짐
ex) SP(IBM), Beowulf Clusters

병렬 프로그래밍 모델
공유메모리 병렬 프로그래밍 모델




공유 메모리 아키텍처에 적합
다중 스레드 프로그램
OpenMP, Pthreads

메시지 패싱 병렬 프로그래밍 모델



분산 메모리 아키텍처에 적합
MPI, PVM

하이브리드 병렬 프로그래밍 모델



분산-공유 메모리 아키텍처
OpenMP + MPI

공유 메모리 병렬 프로그래밍 모델

Single thread
time

time

S1

Multi-thread
Thread

S1

fork

P1

P2

P1
P2
P3

P3

join

S2
S2

Shared address space

P4
Process

S2

Process

P4

메시지 패싱 병렬 프로그래밍 모델

Serial
time

time

S1

Messagepassing

S1

S1

S1

S1

P1

P1

P2

P3

P4

P2

S2
S2

S2
S2

S2
S2

S2
S2

Process 0

Process 1

Process 2

Process 3

Node 1

Node 2

Node 3

Node 4

P3
P4
S2
S2
Process

Data transmission over the interconnect

하이브리드 병렬 프로그래밍 모델

Message-passing

P1

fork

P2

time

time

S1

Thread

S1

P3

Shared
address

fork

P4
join

join

S2
S2

Thread

S2
S2

Shared
address

Process 0

Process 1

Node 1

Node 2

DSM 시스템의 메시지 패싱

time

S1

S1

S1

S1

P1

P2

P3

P4

Message-passing
S2
S2

S2
S2

S2
S2

S2
S2

Process 0

Process 1

Process 2

Process 3

Node 1

Node 2

SPMD와 MPMD (1/4)

SPMD(Single Program Multiple Data)


하나의 프로그램이 여러 프로세스에서 동시에 수행됨



어떤 순간 프로세스들은 같은 프로그램내의 명령어들을 수행하며 그 명령어들은 같을 수도
다를 수도 있음

MPMD (Multiple Program Multiple Data)


한 MPMD 응용 프로그램은 여러 개의 실행 프로그램으로 구성



응용프로그램이 병렬로 실행될 때 각 프로세스는 다른 프로세스와 같거나 다른 프로그램을

실행할 수 있음

SPMD와 MPMD (2/4)
SPMD

a.out

Node 1

Node 2

Node 3

SPMD와 MPMD (3/4)

MPMD : Master/Worker (Self-Scheduling)

a.out

Node 1

b.out

Node 2

Node 3

SPMD와 MPMD (4/4)
MPMD: Coupled Analysis

a.out

b.out

c.out

Node 1

Node 2

Node 3

•성능측정
•성능에 영향을 주는 요인들
•병렬 프로그램 작성순서

프로그램 실행시간 측정 (1/2)
time
사용방법(bash, ksh) : $time [executable]

$ time mpirun –np 4 –machinefile machines ./exmpi.x
real 0m3.59s
user 0m3.16s
sys
0m0.04s


real = wall-clock time



User = 프로그램 자신과 호출된 라이브러리 실행에 사용된 CPU 시간



Sys = 프로그램에 의해 시스템 호출에 사용된 CPU 시간



user + sys = CPU time

프로그램 실행시간 측정 (2/2)
사용방법(csh) : $time [executable]

$ time testprog
1.150u 0.020s 0:01.76 66.4% 15+3981k 24+10io 0pf+0w
①
②
③
④
⑤
⑥
⑦ ⑧
① user CPU time (1.15초)
② system CPU time (0.02초)
③ real time (0분 1.76초)
④ real time에서 CPU time이 차지하는 정도(66.4%)
⑤ 메모리 사용 : Shared (15Kbytes) + Unshared (3981Kbytes)
⑥ 입력(24 블록) + 출력(10 블록)
⑦ no page faults
⑧ no swaps

성능측정
병렬화를 통해 얻어진 성능이득의 정량적 분석
성능측정
 성능향상도
 효율
 Cost

성능향상도 (1/7)
성능향상도 (Speed-up) : S(n)

S(n) =

순차 프로그램의 실행시간
=
병렬 프로그램의 실행시간(n개 프로세서)

ts
tp



순차 프로그램에 대한 병렬 프로그램의 성능이득 정도



실행시간 = Wall-clock time



실행시간이 100초가 걸리는 순차 프로그램을 병렬화 하여 10개의 프로세서로 50초 만에 실행
되었다면,
 S(10) =

100
=
50

2

이상(Ideal) 성능향상도 : Amdahl‟s Law
 f : 코드의 순차부분 (0 ≤ f ≤ 1)
 tp = fts + (1-f)ts/n

순차부분 실행시
간

병렬부분 실행시
간


ts
(1

fts
Serial section

f )t S

Parallelizable sections

1

2

n-1

n

1
2
n processes

n-1
n

tp

(1 f )t S / n


 S(n) =

ts =
tp

ts
fts + (1-f)ts/n
1

S(n) =



최대 성능향상도 ( n  ∞ )
S(n) =



f + (1-f)/n

1
f

프로세서의 개수를 증가하면, 순차부분 크기의 역수에 수렴

f = 0.2, n = 4

Serial
Parallel
process 1

20

20

80

20

process 2
process 3

cannot be parallelized

process 4

can be parallelized

S(4) =

1
0.2 + (1-0.2)/4

= 2.5

프로세서 개수 대 성능향상도

f=0

24

Speed-up

20

16

f=0.05

12

f=0.1

8

f=0.2

4

0
0

4

8

12

16

20

number of processors, n

24

순차부분 대 성능향상도

16
14

Speed-up

12

n=256

10
8
6

n=16

4
2
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Serial fraction, f

0.8

0.9

1

효율
효율 (Efficiency) : E(n)

E(n) =



ts
=

tpⅹn

S(n)
n

프로세서 개수에 따른 병렬 프로그램의 성능효율을 나타냄

• 10개의 프로세서로 2배의 성능향상 :
– S(10) = 2



E(10) = 20 %

• 100개의 프로세서로 10배의 성능향상 :
– S(100) = 10



E(100) = 10 %

Cost
Cost
Cost = 실행시간 ⅹ 프로세서 개수



순차 프로그램 : Cost = ts



병렬 프로그램 : Cost = tp ⅹ n =

tsn

S(n)

=

ts

E(n)

예) 10개의 프로세서로 2배, 100개의 프로세서로 10배의 성능향상

ts

tp

n

S(n)

E(n)

Cost

100

50

10

2

0.2

500

100

10

100

10

0.1

1000

실질적 성능향상에 고려할 사항
실제 성능향상도 : 통신부하, 로드 밸런싱 문제
20

80

Serial
parallel

20

20

process 1

cannot be parallelized

process 2

can be parallelized

process 3

communication overhead

process 4

Load unbalance

성능증가를 위한 방안들

1.

프로그램에서 병렬화 가능한 부분(Coverage) 증가
 알고리즘 개선

2.

작업부하의 균등 분배 : 로드 밸런싱

3.

통신에 소비하는 시간(통신부하) 감소

성능에 영향을 주는 요인들

Coverage : Amdahl’s Law
로드 밸런싱
동기화
통신부하

세분성
입출력

로드 밸런싱
모든 프로세스들의 작업시간이 가능한 균등하도록 작업을 분배하여 작업대기시간을 최소화 하는 것


데이터 분배방식(Block, Cyclic, Block-Cyclic) 선택에 주의



이기종 시스템을 연결시킨 경우, 매우 중요함



동적 작업할당을 통해 얻을 수도 있음

task0

WORK

task1

WAIT

task2
task3

time

동기화
병렬 태스크의 상태나 정보 등을 동일하게 설정하기 위한 조정작업



대표적 병렬부하 : 성능에 악영향
장벽, 잠금, 세마포어(semaphore), 동기통신 연산 등 이용

병렬부하 (Parallel Overhead)



병렬 태스크의 시작, 종료, 조정으로 인한 부하

• 시작 : 태스크 식별, 프로세서 지정, 태스크 로드, 데이터 로드 등
• 종료 : 결과의 취합과 전송, 운영체제 자원의 반납 등
• 조정 : 동기화, 통신 등

통신부하 (1/4)
데이터 통신에 의해 발생하는 부하


네트워크 고유의 지연시간과 대역폭 존재

메시지 패싱에서 중요
통신부하에 영향을 주는 요인들

 동기통신? 비동기 통신?
 블록킹? 논블록킹?
 점대점 통신? 집합통신?
 데이터전송 횟수, 전송하는 데이터의 크기

통신부하 (2/4)

통신시간 = 지연시간 +


메시지 크기
대역폭

지연시간 : 메시지의 첫 비트가 전송되는데 걸리는 시간

• 송신지연 + 수신지연 + 전달지연


대역폭 : 단위시간당 통신 가능한 데이터의 양(MB/sec)

유효 대역폭 =

메시지 크기
=

통신시간

대역폭
1+지연시간ⅹ대역폭/메시지크기

통신부하 (3/4)

Communication time

Communication Time

1/slope = Bandwidth

Latency

message size

통신부하 (4/4)

Effective Bandwidth
effective bandwidth
(MB/sec)

1000

network bandwidth
100

10

1

• latency = 22 ㎲
• bandwidth = 133 MB/sec

0.1

0.01
1

10

100

1000

10000

100000 1000000

message size(bytes)

세분성 (1/2)
병렬 프로그램내의 통신시간에 대한 계산시간의 비


Fine-grained 병렬성

• 통신 또는 동기화 사이의 계산작업이 상대적으로 적음
• 로드 밸런싱에 유리


Coarse-grained 병렬성

• 통신 또는 동기화 사이의 계산작업이 상대적으로 많음
• 로드 밸런싱에 불리
일반적으로 Coarse-grained 병렬성이 성능면에서 유리


계산시간 < 통신 또는 동기화 시간



알고리즘과 하드웨어 환경에 따라 다를 수 있음

세분성 (2/2)

time

time
Communication

Communication

Computation

Computation

(a) Fine-grained

(b) Coarse-grained

입출력
일반적으로 병렬성을 방해함


쓰기 : 동일 파일공간을 이용할 경우 겹쳐 쓰기 문제



읽기 : 다중 읽기 요청을 처리하는 파일서버의 성능 문제



네트워크를 경유(NFS, non-local)하는 입출력의 병목현상

입출력을 가능하면 줄일 것


I/O 수행을 특정 순차영역으로 제한해 사용



지역적인 파일공간에서 I/O 수행

병렬 파일시스템의 개발 (GPFS, PVFS, PPFS…)
병렬 I/O 프로그래밍 인터페이스 개발 (MPI-2 : MPI I/O)

확장성 (1/2)
확장된 환경에 대한 성능이득을 누릴 수 있는 능력


하드웨어적 확장성



알고리즘적 확장성

확장성에 영향을 미치는 주요 하드웨어적 요인



CPU-메모리 버스 대역폭



네트워크 대역폭



메모리 용량



프로세서 클럭 속도

Speedu
p

확장성 (2/2)

Number of Workers

의존성과 교착
데이터 의존성 : 프로그램의 실행 순서가 실행 결과에 영향을 미치는 것

DO k = 1, 100
F(k + 2) = F(k +1) + F(k)
ENDDO
교착 : 둘 이상의 프로세스들이 서로 상대방의 이벤트 발생을 기다리는 상태

Process 1
X = 4
SOURCE = TASK2
RECEIVE (SOURCE,Y)
DEST = TASK2
SEND (DEST,X)
Z = X + Y

Process 2
Y = 8
SOURCE = TASK1
RECEIVE (SOURCE,X)
DEST = TASK1
SEND (DEST,Y)
Z = X + Y

의존성

F(1)

F(2)

F(3)

F(4)

F(5)

F(6)

F(7)

…

F(n)

1

2

3

4

5

6

7

…

n

DO k = 1, 100
F(k + 2) = F(k +1) + F(k)
ENDDO
Serial

F(1)

F(2)

F(3)

F(4)

F(5)

F(6)

F(7)

…

F(n)

1

2

3

5

8

13

21

…

…

F(1)

F(2)

F(3)

F(4)

F(5)

F(6)

F(7)

…

F(n)

1

2

3

5(4)

7

11

18

…

…

Parallel

병렬 프로그램 작성 순서
①

순차코드 작성, 분석(프로파일링), 최적화



②

hotspot, 병목지점, 데이터 의존성 등을 확인
데이터 병렬성/태스크 병렬성 ?

병렬코드 개발


MPI/OpenMP/… ?



태스크 할당과 제어, 통신, 동기화 코드 추가

③

컴파일, 실행, 디버깅

④

병렬코드 최적화


성능측정과 분석을 통한 성능개선

디버깅과 성능분석
디버깅


코드 작성시 모듈화 접근 필요



통신, 동기화, 데이터 의존성, 교착 등에 주의



디버거 : TotalView

성능측정과 분석


timer 함수 사용



프로파일러 : prof, gprof, pgprof, TAU

OpenMP란 무엇인가?

공유메모리 환경에서
다중 스레드 병렬 프로그램 작성을 위한
응용프로그램 인터페이스(API)

OpenMP의 역사
1990년대 :
 고성능 공유 메모리 시스템의 발전
 업체 고유의 지시어 집합 사용  표준화의 필요성

1994년 ANSI X3H5  1996년 openmp.org 설립
1997년 OpenMP API 발표
Release History
 OpenMP Fortran API 버전 1.0 : 1997년 10월
 C/C++ API 버전 1.0 : 1998년 10월
 Fortran API 버전 1.1 : 1999년 11월
 Fortran API 버전 2.0 : 2000년 11월
 C/C++ API 버전 2.0 : 2002년 3월
 Combined C/C++ and Fortran API 버전 2.5 : 2005년 5월
 API 버전 3.0 : 2008년 5월

OpenMP의 목표

표준과 이식성
공유메모리 병렬 프로그래밍의 표준
대부분의 Unix와 Windows에 OpenMP 컴파일러 존재
Fortran, C/C++ 지원

OpenMP의 구성 (1/2)

Directives

Runtime
Library

Environment
Variables

OpenMP의 구성 (2/2)
컴파일러 지시어


스레드 사이의 작업분담, 통신, 동기화를 담당



좁은 의미의 OpenMP

예) C$OMP PARALLEL DO
실행시간 라이브러리


병렬 매개변수(참여 스레드의 개수, 번호 등)의 설정과 조회

예) CALL omp_set_num_threads(128)
환경변수


실행 시스템의 병렬 매개변수(스레드 개수 등)를 정의

예) export OMP_NUM_THREADS=8

OpenMP 프로그래밍 모델 (1/4)
컴파일러 지시어 기반


순차코드의 적절한 위치에 컴파일러 지시어 삽입



컴파일러가 지시어를 참고하여 다중 스레드 코드 생성



OpenMP를 지원하는 컴파일러 필요



동기화, 의존성 제거 등의 작업 필요

Fork-Join



병렬화가 필요한 부분에 다중 스레드 생성
병렬계산을 마치면 다시 순차적으로 실행

F

J

F

J

O

O

O

O

Master

R

I

R

I

Thread

K

N

K

N

[Parallel Region]

[Parallel Region]

컴파일러 지시어 삽입

Serial Code
PROGRAM exam
…
ialpha = 2
DO i = 1, 100
a(i) = a(i) + ialpha*b(i)
ENDDO
PRINT *, a
END

Parallel Code
PROGRAM exam
…
ialpha = 2
!$OMP PARALLEL DO
DO i = 1, 100
a(i) = a(i) + ialpha*b(i)
ENDDO
!$OMP END PARALLEL DO
PRINT *, a
END


Fork-Join
※ export OMP_NUM_THREADS = 4

ialpha = 2

(Master Thread)

(Fork)
DO i=1,25

DO i=26,50

DO i=51,75

DO i=76,100

...

...

...

...

(Join)

(Master)

PRINT *, a

(Slave)

(Master Thread)

(Slave)

(Slave)

OpenMP의 장점과 단점

장 점
 MPI보다 코딩, 디버깅이 쉬움
 데이터 분배가 수월

단 점
• 공유메모리환경의 다중 프로세서
아키텍처에서만 구현 가능

 점진적 병렬화가 가능

• OpenMP를 지원하는 컴파일러 필요

 하나의 코드를 병렬코드와 순차코

• 루프에 대한 의존도가 큼  낮은

드로 컴파일 가능
 상대적으로 코드 크기가 작음

병렬화 효율성
• 공유메모리 아키텍처의 확장성
(프로세서 수, 메모리 등) 한계

OpenMP의 전형적 사용
데이터 병렬성을 이용한 루프의 병렬화
1. 시간이 많이 걸리는 루프를 찾음 (프로파일링)
2. 의존성, 데이터 유효범위 조사
3. 지시어 삽입으로 병렬화
태스크 병렬성을 이용한 병렬화도 가능

지시어 (1/5)
OpenMP 지시어 문법

Fortran

(고정형식:f77)
지시어 시작
(감시문자)
줄 바꿈
선택적
컴파일
시작위치

Fortran

(자유형식:f90)

C

▪ !$OMP <지시어>
▪ C$OMP <지시어>


▪ #pragma omp

▪ !$OMP <지시어> &

▪ #pragma omp …

▪ *$OMP <지시어>

!$OMP& …

…

…

▪ !$ …
▪ C$ …

▪ !$ …

▪ #ifdef _OPENMP

▪ *$ …
첫번째 열

무관

무관

지시어 (2/5)
병렬영역 지시어




PARALLEL/END PARALLEL
코드부분을 병렬영역으로 지정
지정된 영역은 여러 스레드에서 동시에 실행됨

작업분할 지시어




DO/FOR
병렬영역 내에서 사용
루프인덱스를 기준으로 각 스레드에게 루프작업 할당

결합된 병렬 작업분할 지시어



PARALLEL DO/FOR
PARALLEL + DO/FOR의 역할을 수행

지시어 (3/5)
병렬영역 지정

Fortran
!$OMP PARALLEL
DO i = 1, 10
PRINT *, „Hello World‟, i
ENDDO
!$OMP END PARALLEL

C
#pragma omp parallel
for(i=1; i<=10; i++)
printf(“Hello World %dn”,i);

지시어 (4/5)
병렬영역과 작업분할

Fortran

C

!$OMP PARALLEL


!$OMP DO
DO i = 1, 10
PRINT *, „Hello World‟, i
ENDDO
[!$OMP END DO]
!$OMP END PARALLEL

{
#pragma omp for
for(i=1; i<=10; i++)

printf(“Hello World %dn”,i);
}

지시어 (5/5)
병렬영역과 작업분할

Fortran
!$OMP PARALLEL
!$OMP DO
DO i = 1, n
a(i) = b(i) + c(i)
ENDDO
[!$OMP END DO]
Optional
!$OMP DO
…
[!$OMP END DO]
!$OMP END PARALLEL

C
{
#pragma omp for
for (i=1; i<=n; i++) {
a[i] = b[i] + c[i]
}
#pragma omp for
for(…){
…
}
}

실행시간 라이브러리와 환경변수 (1/3)
실행시간 라이브러리




omp_set_num_threads(integer) : 스레드 개수 지정
omp_get_num_threads() : 스레드 개수 리턴
omp_get_thread_num() : 스레드 ID 리턴

환경변수


OMP_NUM_THREADS : 사용 가능한 스레드 최대 개수

• export OMP_NUM_THREADS=16 (ksh)
• setenv OMP_NUM_THREADS 16 (csh)
C : #include <omp.h>

실행시간 라이브러리와 환경변수 (3/3)
omp_set_num_threads
omp_get_thread_num

INTEGER OMP_GET_THREAD_NUM

CALL OMP_SET_NUM_THREADS(4)

Fortran

!$OMP PARALLEL
PRINT*, ′Thread rank: ′, OMP_GET_THREAD_NUM()
!$OMP END PARALLEL

#include <omp.h>
omp_set_num_threads(4);

C

{
printf(″Thread rank:%d＼n″,omp_get_thread_num());

}

주요 Clauses

private(var1, var2, …)
shared(var1, var2, …)

default(shared|private|none)
firstprivate(var1, var2, …)
lastprivate(var1, var2, …)
reduction(operator|intrinsic:var1, var2,…)
schedule(type [,chunk])

clause : reduction (1/4)
reduction(operator|intrinsic:var1, var2,…)


reduction 변수는 shared

• 배열 가능(Fortran only): deferred shape, assumed shape array 사
용 불가
• C는 scalar 변수만 가능


각 스레드에 복제돼 연산에 따라 다른 값으로 초기화되고(표 참조) 병렬 연산 수행



다중 스레드에서 병렬로 수행된 계산결과를 환산해 최종 결과를 마스터 스레드로 내 놓
음

!$OMP DO reduction(+:sum)
DO i = 1, 100
sum = sum + x(i)

ENDDO

Thread 0

Thread 1

sum0 = 0

sum1 = 0

DO i = 1, 50

DO i = 51, 100

sum0 = sum0 + x(i)
ENDDO

sum = sum0 + sum1

sum1 = sum1 + x(i)
ENDDO

Reduction Operators : Fortran

Operator

Data Types

초기값

+

integer, floating point (complex or real)

0

*


1

-


0

.AND.

logical

.TRUE.

.OR.

logical

.FALSE.

.EQV.

logical

.TRUE.

.NEQV.

logical

.FALSE.

MAX

integer, floating point (real only)

가능한 최소값

MIN

integer, floating point (real only)

가능한 최대값

IAND

integer

all bits on

IOR

integer

0

IEOR

integer

0

Reduction Operators : C

Operator

Data Types

초기값

+

integer, floating point

0

*


1

-


0

&

integer

all bits on

|

integer

0

^

integer

0

&&

integer

1

||

integer

0

III. Parallel Programming using MPI

Current HPC Platforms : COTS-Based Clusters

COTS = Commercial off-the-shelf

Nehalem

Access
Control

File
Server(s)

Gulftown

…

Login Node(s)

88

Compute Nodes

Memory Architectures

Shared Memory


Single address space for all processors

<NUMA>
<UMA>

Distributed Memory

89

What is MPI?
MPI = Message Passing Interface
MPI is a specification for the developers and users of message passing libraries. By itself, it
is NOT a library – but rather the specification of what such a library should be.
MPI primarily addresses the message-passing parallel programming model : data is moved
from the address space of one process to that of another process through cooperative
operations on each process.
Simply stated, the goal of the message Passing Interface is to provide a widely used standard
for writing message passing programs. The interface attempts to be :



Portable



Efficient



90

Practical

Flexible

What is MPI?
The MPI standard has gone through a number of revisions, with the most recent version
being MPI-3.
Interface specifications have been defined for C and Fortran90 language bindings :


C++ bindings from MPI-1 are removed in MPI-3



MPI-3 also provides support for Fortran 2003 and 2008 features

Actual MPI library implementations differ in which version and features of the MPI standard
they support. Developers/users will need to be aware of this.

91

Programming Model
Originally, MPI was designed for distributed memory architectures, which were becoming
increasingly popular at time (1980s – early 1990s).

As architecture trends changed, shared memory SMPs were combined over networks
creating hybrid distributed memory/shared memory systems.

92

Programming Model
MPI implementers adapted their libraries to handle both types of underlying memory
architectures seamlessly. They also adapted/developed ways of handing different
interconnects and protocols.

Today, MPI runs on virtually any hardware platform :


Distributed Memory



Shared Memory



Hybrid

The programming model clearly remains a distributed memory model however, regardless of
the underlying physical architecture of the machine.

93

Reasons for Using MPI
Standardization


MPI is the only message passing library which can be considered a standard. It is
supported on virtually all HPC platforms. Practically, it has replaced all previous
message passing libraries.

Portability


There is little or no need to modify your source code when you port your application to a
different platform that supports (and is compliant with) the MPI standard.

Performance Opportunities


Vendor implementations should be able to exploit native hardware features to optimize
performance.

Functionality


There are over 440 routines defined in MPI-3, which includes the majority of those in
MPI-2 and MPI-1.

Availability


94

A Variety of implementations are available, both vendor and public domain.

History and Evolution
MPI has resulted from the efforts of numerous individuals and groups that began in 1992.
1980s – early 1990s : Distributed memory, parallel computing develops, as do a number of
incompatible soft ware tools for writing such programs – usually with tradeoffs between
portability, performance, functionality and price. Recognition of the need for a standard arose.
Apr 1992 : Workshop on Standards for Message Passing in a Distributed Memory
Environment, Sponsored by the Center for Research on Parallel Computing, Williamsburg,
Virginia. The basic features essential to a standard message passing interface were
discussed, and a working group established to continue the standardization process.
Preliminary draft proposal developed subsequently.

95

Nov 1992 : Working group meets in Minneapolis. MPI draft proposal (MPI1) from ORNL
presented. Group adopts procedures and organization to form the MPI Forum. It eventually
comprised of about 175 individuals from 40 organizations including parallel computer
vendors, software writers, academia and application scientists.
Nov 1993 : Supercomputing 93 conference – draft MPI standard presented.
May 1994 : Final version of MPI-1.0 released.
MPI-1.0 was followed by versions MPI-1.1 (Jun 1995), MPI-1.2 (Jul 1997) and MPI-1.3 (May
2008).
MPI-2 picked up where the first MPI specification left off, and addressed topics which went far
beyond the MPI-1 specification. Was finalized in 1996.
MPI-2.1 (Sep 2009), and MPI-2.2 (Sep 2009) followed.
Sep 2012 : The MPI-3.0 standard was approved.

96

Documentation for all versions of the MPI standard is available at :


97

http://www.mpi-forum.org/docs/

A General Structure of the MPI Program

98

A Header File for MPI routines
Required for all programs that make MPI library calls.

C include file

Fortran include file

#include “mpi.h”

include „mpif.h‟

With MPI-3 Fortran, the USE mpi_f80 module is preferred over using the include file shown
above.

99

The Format of MPI Calls
C names are case sensitive; Fortran name are not.
Programs must not declare variables or functions with names beginning with the prefix MPI_
or PMPI_ (profiling interface).

C Binding

Format

rc = MPI_Xxxxx(parameter, …)

Example

rc = MPI_Bsend(&buf, count, type, dest, tag, comm)

Error code

Returned as “rc”, MPI_SUCCESS if successful.
Fortran Binding

Format

Example

call MPI_BSEND(buf, count, type, dest, tag, comm, ierr)

Error code

100

CALL MPI_XXXXX(parameter, …, ierr)
call mpi_xxxxx(parameter, …, ierr)
Returned as “ierr” parameter, MPI_SUCCESS if successful.

Communicators and Groups
MPI uses objects called communicators and groups to define which collection of processes
may communicate with each other.
Most MPI routines require you to specify a communicator as an argument.
Communicators and groups will be covered in more detail later. For now, simply use
MPI_COMM_WORLD whenever a communicator is required - it is the predefined
communicator that includes all of your MPI processes.

101

Rank
Within a communicator, every process has its own unique, integer identifier assigned by the
system when the process initializes. A rank is sometimes also called a “task ID”. Ranks are
contiguous and begin at zero.
Used by the programmer to specify the source and destination of messages. Often used
conditionally by the application to control program execution (if rank = 0 do this / if rank = 1
do that).

102

Error Handling
Most MPI routines include a return/error code parameter, as described in “Format of MPI
Calls” section above.
However, according to the MPI standard, the default behavior of an MPI call is to abort if there
is an error. This means you will probably not be able to capture a return/error code other than
MPI_SUCCESS (zero).
The standard does provide a means to override this default error handler. You can also
consult the error handing section of the MPI Standard located at http://www.mpiforum.org/docs/mpi-11-html/node148.html .
The types of errors displayed to the user are implementation dependent.

103

Environment Management Routines
MPI_Init


Initializes the MPI execution environment. This function must be called is every MPI
program, must be called before any other MPI functions and must be called only once in
an MPI program. For C programs, MPI_Init may be used to pass the command line
arguments to all processes, although this is not required by the standard and is
implementation dependent.

C
MPI_Init(&argc, &argv)




104

Fortran
MPI_INIT(ierr)

Input parameters
• argc : Pointer to the number of arguments
• argv : Pointer to the argument vector
ierr : the error return argument

MPI_Comm_size


Returns the total number of MPI processes in the specified communicator, such as
MPI_COMM_WORLD. If the communicator is MPI_COMM_WORLD, then it represents the
number of MPI tasks available to your application.

C
MPI_Comm_size(comm, &size)





105

Fortran
MPI_COMM_SIZE(comm, size, ierr)

Input parameters
• comm : communicator (handle)
Output parameters
• size : number of processes in the group of comm (integer)

MPI_Comm_rank


Returns the rank of the calling MPI process within the specified communicator. Initially,
each process will be assigned a unique integer rank between 0 and number of tasks -1
within the communicator MPI_COMM_WORLD. This rank is often referred to as a task ID.
If a process becomes associated with other communicators, it will have a unique rank
within each of these as well.

C

MPI_Comm_rank(comm, &rank)




106

Fortran

MPI_COMM_SIZE(comm, rank, ierr)

Input parameters
Output parameters
• rank : rank of the calling process in the group of comm (integer)

MPI_Finalize


Terminates the MPI execution environment. This function should be the last MPI routine
called in every MPI program – no other MPI routines may be called after it.

C
MPI_Finalize()


107


Fortran
MPI_FINALIZE(ierr)

MPI_Abort


Terminates all MPI processes associated with the communicator. In most MPI
implementations it terminates ALL processes regardless of the communicator specified.

C
MPI_Abort(comm, errorcode)




108

Fortran
MPI_ABORT(comm, errorcode, ierr)

Input parameters
• errorcode : error code to return to invoking environment

MPI_Get_processor_name


Return the processor name. Also returns the length of the name. The buffer for “name”
must be at least MPI_MAX_PROCESSOR_NAME characters in size. What is returned into
“name” is implementation dependent – may not be the same as the output of the
“hostname” or “host” shell commands.

C

Fortran

MPI_Get_processor_name(&name,
&resultlength)

MPI_GET_PROCESSOR_NAME(n
ame, resultlength, ierr)





109

Output parameters
• name : A unique specifies for the actual (as opposed to virtual) node. This must be
an array of size at least MPI_MAX_PROCESOR_NAME .
• resultlen : Length (in characters) of the name.

MPI_Get_version


Returns the version (either 1 or 2) and subversion of MPI.

C
MPI_Get_version(&version,
&subversion)




110

Fortran
MPI_GET_VERSION(version,
subversion, ierr)

Output parameters
• version : Major version of MPI (1 or 2)
• subversion : Miner version of MPI.

MPI_Initialized


Indicates whether MPI_Init has been called – returns flag as either logical true(1) or
false(0).

C
MPI_Initialized(&flag)



111

Fortran
MPI_INITIALIZED(flag, ierr)

Output parameters
• flag : Flag is true if MPI_Init has been called and false otherwise.

MPI_Wtime


Returns an elapsed wall clock time in seconds (double precision) on the calling
processor.

C
MPI_Wtime()


Fortran
MPI_WTIME()

Return value
• Time in seconds since an arbitrary time in the past.

MPI_Wtick


Returns the resolution in seconds (double precision) of MPI_Wtime.

C
MPI_Wtick()


112

Fortran
MPI_WTICK()

Return value
• Time in seconds of the resolution MPI_Wtime.

Example: Hello world
#include<stdio.h>
#include"mpi.h"
int main(int argc, char *argv[])
{
int rc;
rc = MPI_Init(&argc, &argv);
printf("Hello world.n");
rc = MPI_Finalize();
return 0;
}

113

Example: Hello world
Execute a mpi program.
$ module load [compiler] [mpi]
$ mpicc hello.c
$ mpirun –np 4 –hostfile [hostfile] ./a.out

Make out a hostfile.
ibs0001
ibs0002
ibs0003
ibs0003
…

114

slots=2
slots=2
slots=2
slots=2

Example : Environment Management Routine
#include "mpi.h”
#include <stdio.h>
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, len, rc;
char hostname[MPI_MAX_PROCESSOR_NAME];
rc = MPI_Init(&argc,&argv);
if (rc != MPI_SUCCESS) {
printf ("Error starting MPI program. Terminating.n");
MPI_Abort(MPI_COMM_WORLD, rc);
}

MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Get_processor_name(hostname, &len);
printf ("Number of tasks= %d My rank= %d Running on %sn", numtasks,rank,hostname);
/*******

do some work *******/

rc = MPI_Finalize();
return 0;
}

115

Types of Point-to-Point Operations
MPI point-to-point operations typically involve message passing between two, and only two,
different MPI tasks. One task is performing a send operation and the other task is performing
a matching receive operation.
There are different types of send and receive routines used for different purposes.


Synchronous send



Blocking send/blocking receive



Non-blocking send/non-blocking receive



Buffered send



Combined send/receive



“Ready” send

Any type of send routine can be paired with any type of receive routine.
MPI also provides several routines associated with send – receive operations, such as those used to wait for
a message’s arrival or prove to find out if a message has arrived.

116

Buffering
In a perfect world, every send operation would be perfectly synchronized with its matching re
ceive. This is rarely the case. Somehow or other, the MPI implementation must be able to deal
with storing data when the two tasks are out of sync.
Consider the following two cases



117

A send operation occurs 5 seconds before the receive is ready – where is the message w
hile the receive is pending?
Multiple sends arrive at the same receiving task which can only accept one send at a tim
e – what happens to the messages that are “backing up”?

Buffering
The MPI implementation (not the MPI standard) decides what happens to data in these types
of cases. Typically, a system buffer area is reserved to hold data in transit.

118

Buffering
System buffer space is :






119

Opaque to the programmer and managed entirely by the MPI library
A finite resource that can be easy to exhaust
Often mysterious and not well documented
Able to exist on the sending side, the receiving side, or both
Something that may improve program performance because it allows send – receive ope
rations to be asynchronous.

Blocking vs. Non-blocking
Most of the MPI point-to-point routines can be used in either blocking or non-blocking mode.
Blocking






A blocking send routine will only “return” after it is safe to modify the application buffer (your
send data) for reuse. Safe means that modifications will not affect the data intended for the rec
eive task. Safe dose not imply that the data was actually received – it may very well be sitting i
n a system buffer.
A blocking send can be synchronous which means there is handshaking occurring with the re
ceive task to confirm a safe send.
A blocking send can be asynchronous if a system buffer is used to hold the data for eventual d
elivery to the receive.
A blocking receive only “returns” after the data has arrived and is ready for use by the progra
m.

Non-blocking





120

Non-blocking send and receive routines behave similarly – they will return almost immediately.
They do not wait for any communication events to complete, such as message copying from u
ser memory to system buffer space or the actual arrival of message.
Non-blocking operations simply “request” the MPI library to perform the operation when it is a
ble. The user can not predict when it is able. The user can not predict when that will happen.
It is unsafe to modify the application buffer (your variable space) until you know for a fact the r
equested non-blocking operation was actually performed by the library. There are “wait” routin
es used to do this.
Non-blocking communications are primarily used to overlap computation with communication
and exploit possibale performance gains.

MPI Message Passing Routine Arguments
MPI point-to-point communication routines generally have an argument list that takes one of t
he following formats :

Blocking sends

MPI_Send(buffer, count, type, dest, tag, comm)

Non-blocking sends

MPI_Isend(buffer, count, type, dest, tag, comm, request)

Blocking receive

MPI_Recv(buffer, count, type, source, tag, comm, status)

Non-blocking receive

MPI_Irecv(buffer, count, type, source, tag, comm, request)

Buffer



Program (application) address space that references the data that is to be sent or receiv
ed. In most cases, this is simply the variable name that is be sent/received. For C progra
ms, this argument is passed by reference and usually must be prepended with an amper
sand : &var1

Data count


121

Indicates the number of data elements of a particular type to be sent.

Data type


For reasons of portability, MPI predefines its elementary data types. The table below lists
those required by the standard.

C Data Types
MPI_CHAR
MPI_SHORT

signed short int

MPI_INT

signed int

MPI_LONG

signed long int

MPI_SIGNED_CHAR

signed char

MPI_UNSIGNED_CHAR

unsigned char

MPI_UNSIGNED_SHORT

unsigned short int

MPI_UNSIGNED

unsigned int

MPI_UNSIGNED_LONG

unsigned long int

MPI_FLOAT

float

MPI_DOUBLE

double

MPI_LONG_DOUBLE

122

signed char

long double

Destination


An argument to send routines that indicates the process where a message should be del
ivered. Specified as the rank of the receiving process.

Tag


Arbitrary non-negative integer assigned by the programmer to uniquely identify a messa
ge. Send and receive operations should match message tags. For a receive operation, th
e wild card MPI_ANY_TAG can be used to receive any message regardless of its tag. The
MPI standard guarantees that integers 0 – 32767 can be used as tags, but most impleme
ntations allow a much larger range than this.

Communicator


123

Indicates the communication context, or set of processes for which the source or destin
ation fields are valid. Unless the programmer is explicitly creating new communicator, th
e predefined communicator MPI_COMM_WORLD is usually used.

Status






For a receive operation, indicates the source of the message and the tag of the message.
In C, this argument is a pointer to predefined structure MPI_Status (ex. stat.MPI_SOURC
E, stat.MPI_TAG).
In Fortran, it is an integer array of size MPI_STATUS_SIZE (ex. stat(MPI_SOURCE), stat(M
PI_TAG)).
Additionally, the actual number of bytes received are obtainable from Status via MPI_Get
_out routine.

Request






124

Used by non-blocking send and receive operations.
Since non-blocking operations may return before the requested system buffer space is o
btained, the system issues a unique “request number”.
The programmer uses this system assigned “handle” later (in a WAIT type routine) to det
ermine completion of the non-blocking operation.
In C, this argument is pointer to predefined structure MPI_Request.
In Fortran, it is an integer.

Example : Blocking Message Passing Routine (1/2)
#include "mpi.h"
#include <stdio.h>
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, dest, source, rc, count, tag=1;
char inmsg, outmsg='x';
MPI_Status Stat;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
dest = 1;
source = 1;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
}
else if (rank == 1) {
dest = 0;
source = 0;
}

125

Example : Blocking Message Passing Routine (2/2)
rc = MPI_Get_count(&Stat, MPI_CHAR, &count);
printf("Task %d: Received %d char(s) from task %d with tag %d n",
rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG);
MPI_Finalize();
return 0;
}

126

Example : Dead Lock
#include "mpi.h"
#include <stdio.h>
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, dest, source, rc, count, tag=1;
char inmsg, outmsg='x';
MPI_Status Stat;
if (rank == 0) {
dest = 1;
source = 1;
}
else if (rank == 1) {
dest = 0;
source = 0;
}

127

Example : Non-Blocking Message Passing Routine (1/2)
Nearest neighbor exchange in a ring topology

#include "mpi.h"
#include <stdio.h>
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, next, prev, buf[2], tag1=1, tag2=2;
MPI_Request reqs[4];
MPI_Status stats[2];
prev = rank-1;
next = rank+1;
if (rank == 0) prev = numtasks - 1;
if (rank == (numtasks - 1)) next = 0;

128

Example : Non-Blocking Message Passing Routine (2/2)
MPI_Irecv(&buf[0], 1, MPI_INT, prev, tag1, MPI_COMM_WORLD, &reqs[0]);
MPI_Irecv(&buf[1], 1, MPI_INT, next, tag2, MPI_COMM_WORLD, &reqs[1]);
MPI_Isend(&rank, 1, MPI_INT, prev, tag2, MPI_COMM_WORLD, &reqs[2]);
MPI_Isend(&rank, 1, MPI_INT, next, tag1, MPI_COMM_WORLD, &reqs[3]);
{

do some work

}

MPI_Waitall(4, reqs, stats);
MPI_Finalize();
return 0;

}

129

Advanced Example : Monte-Carlo Simulation
<Problem>




Monte carlo simulation
Random number use
PI = 4 ⅹAc/As

<Requirement>



N’s processor(rank) use
P2p communication

r

130

Advanced Example : Monte-Carlo Simulation for PI
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main() {
const long num_step=100000000;
long i, cnt;
double pi, x, y, r;

printf(“-----------------------------------------------------------n”);
pi = 0.0;
cnt = 0;
r = 0.0;
for (i=0; i<num_step; i++) {
x = rand() / (RAND_MAX+1.0);
y = rand() / (RAND_MAX+1.0);
r = sqrt(x*x + y*y);
if (r<=1) cnt += 1;
}
pi = 4.0 * (double)(cnt) / (double)(num_step);
printf(“PI = %17.15lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi));
printf(“-----------------------------------------------------------n”);
return 0;
}

131

Advanced Example : Numerical integration for PI
<Problem>


Get PI using Numerical integration

1
0

f ( x1 )

f ( x2 )

4.0
dx =
2)
(1+x
f ( xn )

<Requirement>


Point to point communication
n

4

i 1

1 2
1 ((i 0.5) )
n

1
n

....
1
n

1
(2 0.5)
n
1
(1 0.5)
n
x2

x1

132

xn

(n 0.5)

1
n

#include <stdio.h>
#include <math.h>
int main() {
long i;
double sum, step, pi, x;
step = (1.0/(double)num_step);
sum=0.0;
printf(“-----------------------------------------------------------n”);
x = ((double)i - 0.5) * step;
sum += 4.0/(1.0+x*x);
}
pi = step * sum;
printf(“PI = %5lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi));
printf(“-----------------------------------------------------------n”);
return 0;
}

133

Type of Collective Operations
Synchronization


processes wait until all members of the group have reached the synchronization point.

Data Movement


broadcast, scatter/gather, all to all.

Collective Computation (reductions)


134

one member of the group collects data from the other members and performs an operati
on (min, max, add, multiply, etc.) on that data.

Programming Considerations and Restrictions
With MPI-3, collective operations can be blocking or non-blocking. Only blocking operations
are covered in this tutorial.
Collective communication routines do not take message tag arguments.
Collective operations within subset of processes are accomplished by first partitioning the su
bsets into new groups and then attaching the new groups to new communicators.
Con only be used with MPI predefined datatypes – not with MPI Derived Data Types.
MPI-2 extended most collective operations to allow data movement between intercommunicat
ors (not covered here).

135

Collective Communication Routines
MPI_Barrier


Synchronization operation. Creates a barrier synchronization in a group. Each task,
when reaching the MPI_Barrier call, blocks until all tasks in the group reach the same
MPI_Barrier call. Then all tasks are free to proceed.

C
MPI_Barrier(comm)

136

Fortran
MPI_BARRIER(comm, ierr)

MPI_Bcast


Data movement operation. Broadcasts (sends) a message from the process with rank "r
oot" to all other processes in the group.

C
MPI_Bcast(&buffer, count, datatype,
root, comm)

137

Fortran
MPI_BCAST
(buffer,count,datatype,root,comm,ier
r)

MPI_Scatter


Data movement operation. Distributes distinct messages from a single source task to ea
ch task in the group.

C

Fortran

MPI_Scatter
MPI_SCATTER
(&sendbuf,sendcnt,sendtype,&recvb (sendbuf,sendcnt,sendtype,recvbuf,
uf, recvcnt,recvtype,root,comm)
recvcnt,recvtype,root,comm,ierr)

138

MPI_Gather


Data movement operation. Gathers distinct messages from each task in the group to a si
ngle destination task. This routine is the reverse operation of MPI_Scatter.

C

Fortran

MPI_Gather
MPI_GATHER
(&sendbuf,sendcnt,sendtype,&recvb (sendbuf,sendcnt,sendtype,recvbuf,
uf, recvcount,recvtype,root,comm)
recvcount,recvtype,root,comm,ierr)

139

MPI_Allgather


Data movement operation. Concatenation of data to all tasks in a group. Each task in the
group, in effect, performs a one-to-all broadcasting operation within the group.

C

Fortran

MPI_Allgather
MPI_ALLGATHER
(&sendbuf,sendcount,sendtype,&rec (sendbuf,sendcount,sendtype,recvb
vbuf, recvcount,recvtype,comm)
uf, recvcount,recvtype,comm,info)

140

MPI_Reduce


Collective computation operation. Applies a reduction operation on all tasks in the group
and places the result in one task.

C
MPI_Reduce
(&sendbuf,&recvbuf,count,datatype,
op,root,comm)

141

Fortran
MPI_REDUCE
(sendbuf,recvbuf,count,datatype,op,
root,comm,ierr)

The predefined MPI reduction operations appear below. Users can also define their own
reduction functions by using the MPI_Op_create routine.

MPI Reduction Operation

C Data Types

MPI_MAX

maximum

integer, float

MPI_MIN

minimum

integer, float

MPI_SUM

sum

integer, float

MPI_PROD

product

integer, float

MPI_LAND

logical AND

integer

MPI_BAND

bit-wise AND

integer, MPI_BYTE

MPI_LOR

logical OR

integer

MPI_BOR

bit-wise OR

integer, MPI_BYTE

MPI_LXOR

logical XOR

integer

MPI_BXOR

bit-wise XOR

integer, MPI_BYTE

MPI_MAXLOC

max value and location

float, double and long double

MPI_MINLOC

min value and location

float, double and long double

142

MPI_Allreduce


Collective computation operation + data movement. Applies a reduction operation and pl
aces the result in all tasks in the group. This is equivalent to an MPI_Reduce followed by
an MPI_Bcast.

C

MPI_Allreduce
op,comm)

143

Fortran

MPI_ALLREDUCE
comm,ierr)

MPI_Reduce_scatter


Collective computation operation + data movement. First does an element-wise reductio
n on a vector across all tasks in the group. Next, the result vector is split into disjoint se
gments and distributed across the tasks. This is equivalent to an MPI_Reduce followed b
y an MPI_Scatter operation.

C

MPI_Reduce_scatter
(&sendbuf,&recvbuf,recvcount,datat
ype, op,comm)

144

Fortran

MPI_REDUCE_SCATTER
(sendbuf,recvbuf,recvcount,datatype,
op,comm,ierr)

MPI_Alltoall


Data movement operation. Each task in a group performs a scatter operation, sending a
distinct message to all the tasks in the group in order by index.

C

Fortran

MPI_Alltoall
MPI_ALLTOALL
(&sendbuf,sendcount,sendtype,&rec (sendbuf,sendcount,sendtype,recvb
vbuf, recvcnt,recvtype,comm)
uf, recvcnt,recvtype,comm,ierr)

145

MPI_Scan


Performs a scan operation with respect to a reduction operation across a task group.

C
MPI_Scan
op,comm)

146

Fortran
MPI_SCAN
comm,ierr)

data
P0

A

A

P0

A

A

P1

B

P2

A

P2

C

P3

A

P3

D

broadcast

P1

A*B*C*D

reduce

*:some operator
P0

A

B

C

D

A

P0

A

P1

B

P1

B

P2

C

P2

C

A*B*C*D

D

P3

D

A*B*C*D

scatter

gather

P3

A*B*C*D

all
reduce

A*B*C*D

*:some operator
P0

A

A

B

C

D

P0

A

P1

B

A

B

C

D

P1

B

P2

C

A

B

C

D

P2

C

A*B*C

P3

D

A

B

C

D

P3

D

A*B*C*D

allgather

A

scan

A*B

*:some operator
P0

A0

A1

A2

A3

alltoall

A0

B0

C0

D0

P0

A0

A1

A2

A0*B0*C0*D0

A3

reduce
scatter

A1*B1*C1*D1

P1

B0

B1

B2

B3

A1

B1

C1

D1

P1

B0

B1

B2

B3

P2

C0

C1

C2

C3

A2

B2

C2

D2

P2

C0

C1

C2

C3

A2*B2*C2*D2

P3

D0

D1

D2

D3

A3

B3

C3

D3

P3

D0

D1

D2

D3

A3*B3*C3*D3

*:some operator

147

Example : Collective Communication (1/2)
Perform a scatter operation on the rows of an array
#include "mpi.h"
#include <stdio.h>
#define SIZE 4
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, sendcount, recvcount, source;
float sendbuf[SIZE][SIZE] = {
{1.0, 2.0, 3.0, 4.0},
{5.0, 6.0, 7.0, 8.0},
{9.0, 10.0, 11.0, 12.0},
{13.0, 14.0, 15.0, 16.0} };
float recvbuf[SIZE];

148

Example : Collective Communication (2/2)
if (numtasks == SIZE) {
source = 1;
sendcount = SIZE;
recvcount = SIZE;
MPI_Scatter(sendbuf,sendcount,MPI_FLOAT,recvbuf,recvcount,
MPI_FLOAT,source,MPI_COMM_WORLD);
printf("rank= %d Results: %f %f %f %fn",rank,recvbuf[0],
recvbuf[1],recvbuf[2],recvbuf[3]);
}
else
printf("Must specify %d processors. Terminating.n",SIZE);
MPI_Finalize();
return 0;
}

149

Advanced Example : Monte-Carlo Simulation for PI
Use the collective communication routines!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main() {
long i, cnt;
double pi, x, y, r;
printf(“-----------------------------------------------------------n”);
pi = 0.0;
cnt = 0;
r = 0.0;
x = rand() / (RAND_MAX+1.0);
y = rand() / (RAND_MAX+1.0);
r = sqrt(x*x + y*y);
if (r<=1) cnt += 1;
}
pi = 4.0 * (double)(cnt) / (double)(num_step);
printf(“PI = %17.15lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi));
printf(“-----------------------------------------------------------n”);
return 0;
}

150

Use the collective communication routines!
#include <stdio.h>
#include <math.h>
int main() {
long i;
double sum, step, pi, x;
step = (1.0/(double)num_step);
sum=0.0;
printf(“-----------------------------------------------------------n”);
x = ((double)i - 0.5) * step;
sum += 4.0/(1.0+x*x);
}
pi = step * sum;
printf(“PI = %5lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi));
printf(“-----------------------------------------------------------n”);
return 0;
}

151

Introduction to Parallel Programming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Introduction to Parallel Programming

Similar to Introduction to Parallel Programming (20)

Introduction to Parallel Programming