SlideShare a Scribd company logo
์Šˆํผ์ปดํ“จํŒ… ๊ต์œก - UNIST

Parallel Programming
CONTENT
S
I.

Introduction to Parallel Computing

II.

Parallel Programming using OpenMP

III. Parallel Programming using MPI
I. Introduction to Parallel Computing
๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ (1/3)

๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ž€, ์ˆœ์ฐจ์ ์œผ๋กœ ์ง„ํ–‰๋˜๋Š” ๊ณ„์‚ฐ์˜์—ญ์„

์—ฌ๋Ÿฌ ๊ฐœ๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ๊ฐ์„ ์—ฌ๋Ÿฌ ํ”„๋กœ์„ธ์„œ์—์„œ
๋™์‹œ์— ์ˆ˜ํ–‰ ๋˜๋„๋ก ํ•˜๋Š” ๊ฒƒ
๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ (2/3)
์ˆœ์ฐจ์‹คํ–‰

๋ณ‘๋ ฌ์‹คํ–‰

Inputs

Outputs
๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ (3/3)
์ฃผ๋œ ๋ชฉ์  : ๋”์šฑ ํฐ ๋ฌธ์ œ๋ฅผ ๋”์šฑ ๋นจ๋ฆฌ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ
๏‚ง

ํ”„๋กœ๊ทธ๋žจ์˜ wall-clock time ๊ฐ์†Œ

๏‚ง

ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ์˜ ํฌ๊ธฐ ์ฆ๊ฐ€

๋ณ‘๋ ฌ ์ปดํ“จํŒ… ๊ณ„์‚ฐ ์ž์›
๏‚ง

์—ฌ๋Ÿฌ ๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ(CPU)๋ฅผ ๊ฐ€์ง€๋Š” ๋‹จ์ผ ์ปดํ“จํ„ฐ

๏‚ง

๋„คํŠธ์›Œํฌ๋กœ ์—ฐ๊ฒฐ๋œ ๋‹ค์ˆ˜์˜ ์ปดํ“จํ„ฐ
์™œ ๋ณ‘๋ ฌ์ธ๊ฐ€?
๊ณ ์„ฑ๋Šฅ ๋‹จ์ผ ํ”„๋กœ์„ธ์„œ ์‹œ์Šคํ…œ ๊ฐœ๋ฐœ์˜ ์ œํ•œ
๏‚ง

์ „์†ก์†๋„์˜ ํ•œ๊ณ„ (๊ตฌ๋ฆฌ์„  : 9 cm/nanosec)

๏‚ง

์†Œํ˜•ํ™”์˜ ํ•œ๊ณ„

๏‚ง

๊ฒฝ์ œ์  ์ œํ•œ

๋ณด๋‹ค ๋น ๋ฅธ ๋„คํŠธ์›Œํฌ, ๋ถ„์‚ฐ ์‹œ์Šคํ…œ, ๋‹ค์ค‘ ํ”„๋กœ์„ธ์„œ ์‹œ์Šคํ…œ ์•„ํ‚คํ…์ฒ˜์˜ ๋“ฑ์žฅ ๏ƒจ ๋ณ‘๋ ฌ ์ปดํ“จํŒ… ํ™˜๊ฒฝ

์ƒ๋Œ€์ ์œผ๋กœ ๊ฐ’์‹ผ ํ”„๋กœ์„ธ์„œ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ ๋ฌถ์–ด ๋™์‹œ์— ์‚ฌ์šฉํ•จ
์œผ๋กœ์จ ์›ํ•˜๋Š” ์„ฑ๋Šฅ์ด๋“ ๊ธฐ๋Œ€
ํ”„๋กœ๊ทธ๋žจ๊ณผ ํ”„๋กœ์„ธ์Šค
ํ”„๋กœ์„ธ์Šค๋Š” ๋ณด์กฐ ๊ธฐ์–ต ์žฅ์น˜์— ํ•˜๋‚˜์˜ ํŒŒ์ผ๋กœ์„œ ์ €์žฅ๋˜์–ด ์žˆ๋˜ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ํ”„๋กœ๊ทธ๋žจ์ด ๋กœ๋”ฉ๋˜์–ด ์šด์˜
์ฒด์ œ(์ปค๋„)์˜ ์‹คํ–‰ ์ œ์–ด ์ƒํƒœ์— ๋†“์ธ ๊ฒƒ

๏‚ง

ํ”„๋กœ๊ทธ๋žจ : ๋ณด์กฐ ๊ธฐ์–ต ์žฅ์น˜์— ์ €์žฅ

๏‚ง

ํ”„๋กœ์„ธ์Šค : ์ปดํ“จํ„ฐ ์‹œ์Šคํ…œ์— ์˜ํ•˜์—ฌ ์‹คํ–‰ ์ค‘์ธ ํ”„๋กœ๊ทธ๋žจ

๏‚ง

ํƒœ์Šคํฌ = ํ”„๋กœ์„ธ์Šค
ํ”„๋กœ์„ธ์Šค
ํ”„๋กœ๊ทธ๋žจ ์‹คํ–‰์„ ์œ„ํ•œ ์ž์› ํ• ๋‹น์˜ ๋‹จ์œ„๊ฐ€ ๋˜๊ณ , ํ•œ ํ”„๋กœ๊ทธ๋žจ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ ์‹คํ–‰ ๊ฐ€๋Šฅ
๋‹ค์ค‘ ํ”„๋กœ์„ธ์Šค๋ฅผ ์ง€์›ํ•˜๋Š” ๋‹จ์ผ ํ”„๋กœ์„ธ์„œ ์‹œ์Šคํ…œ
๏‚ง

์ž์› ํ• ๋‹น์˜ ๋‚ญ๋น„, ๋ฌธ๋งฅ๊ตํ™˜์œผ๋กœ ์ธํ•œ ๋ถ€ํ•˜ ๋ฐœ์ƒ

๏‚ง

๋ฌธ๋งฅ๊ตํ™˜

โ€ข ์–ด๋–ค ์ˆœ๊ฐ„ ํ•œ ํ”„๋กœ์„ธ์„œ์—์„œ ์‹คํ–‰ ์ค‘์ธ ํ”„๋กœ์„ธ์Šค๋Š” ํ•ญ์ƒ ํ•˜๋‚˜
โ€ข ํ˜„์žฌ ํ”„๋กœ์„ธ์Šค ์ƒํƒœ ์ €์žฅ ๏ƒ  ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์Šค ์ƒํƒœ ์ ์žฌ
๋ถ„์‚ฐ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ์˜ ์ž‘์—…ํ• ๋‹น ๊ธฐ์ค€
์Šค๋ ˆ๋“œ
ํ”„๋กœ์„ธ์Šค์—์„œ ์‹คํ–‰์˜ ๊ฐœ๋…๋งŒ์„ ๋ถ„๋ฆฌํ•œ ๊ฒƒ
๏‚ง

ํ”„๋กœ์„ธ์Šค = ์‹คํ–‰๋‹จ์œ„(์Šค๋ ˆ๋“œ) + ์‹คํ–‰ํ™˜๊ฒฝ(๊ณต์œ ์ž์›)

๏‚ง

ํ•˜๋‚˜์˜ ํ”„๋กœ์„ธ์Šค์— ์—ฌ๋Ÿฌ ๊ฐœ ์กด์žฌ๊ฐ€๋Šฅ

๏‚ง

๊ฐ™์€ ํ”„๋กœ์„ธ์Šค์— ์†ํ•œ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ์™€ ์‹คํ–‰ํ™˜๊ฒฝ์„ ๊ณต์œ 

๋‹ค์ค‘ ์Šค๋ ˆ๋“œ๋ฅผ ์ง€์›ํ•˜๋Š” ๋‹จ์ผ ํ”„๋กœ์„ธ์„œ ์‹œ์Šคํ…œ
๏‚ง

๋‹ค์ค‘ ํ”„๋กœ์„ธ์Šค๋ณด๋‹ค ํšจ์œจ์ ์ธ ์ž์› ํ• ๋‹น

๏‚ง

๋‹ค์ค‘ ํ”„๋กœ์„ธ์Šค๋ณด๋‹ค ํšจ์œจ์ ์ธ ๋ฌธ๋งฅ๊ตํ™˜

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ์˜ ์ž‘์—…ํ• ๋‹น ๊ธฐ์ค€
ํ”„๋กœ์„ธ์Šค์™€ ์Šค๋ ˆ๋“œ

ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๋ฅผ ๊ฐ–๋Š” 3๊ฐœ์˜ ํ”„๋กœ์„ธ์Šค

์Šค๋ ˆ๋“œ

ํ”„๋กœ์„ธ์Šค

3๊ฐœ์˜ ์Šค๋ ˆ๋“œ๋ฅผ ๊ฐ–๋Š” ํ•˜๋‚˜์˜ ํ”„๋กœ์„ธ์Šค
๋ณ‘๋ ฌ์„ฑ ์œ ํ˜•
๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ (Data Parallelism)
๏‚ง

๋„๋ฉ”์ธ ๋ถ„ํ•ด (Domain Decomposition)

๏‚ง

๊ฐ ํƒœ์Šคํฌ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๋™์ผํ•œ ์ผ๋ จ์˜ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰

ํƒœ์Šคํฌ ๋ณ‘๋ ฌ์„ฑ (Task Parallelism)
๏‚ง

๊ธฐ๋Šฅ์  ๋ถ„ํ•ด (Functional Decomposition)

๏‚ง

๊ฐ ํƒœ์Šคํฌ๋Š” ๊ฐ™๊ฑฐ๋‚˜ ๋˜๋Š” ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์„œ๋กœ ๋‹ค๋ฅธ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰
๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ (1/3)

๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ : ๋„๋ฉ”์ธ ๋ถ„ํ•ด

Problem Data Set

Task 1

Task 2

Task 3

Task 4
๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ (2/3)

์ฝ”๋“œ ์˜ˆ) : ํ–‰๋ ฌ์˜ ๊ณฑ์…ˆ (OpenMP)

Serial Code

Parallel Code
!$OMP PARALLEL DO

DO K=1,N

DO K=1,N

DO J=1,N

DO J=1,N

DO I=1,N
C(I,J) = C(I,J) +

DO I=1,N
C(I,J) = C(I,J) +

(A(I,K)*B(K,J))
END DO
END DO
END DO

A(I,K)*B(K,J)
END DO
END DO
END DO
!$OMP END PARALLEL DO
๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ (3/3)
๋ฐ์ดํ„ฐ ๋ถ„ํ•ด (ํ”„๋กœ์„ธ์„œ 4๊ฐœ:K=1,20์ผ ๋•Œ)

Process

Proc0
Proc1
Proc2
Proc3

Iterations of K

K =
K =

1:5
6:10

K = 11:15
K = 16:20

Data Elements

A(I,1:5)
B(1:5,J)
A(I,6:10)
B(6:10,J)
A(I,11:15)
B(11:15,J)
A(I,16:20)
B(16:20,J)
ํƒœ์Šคํฌ ๋ณ‘๋ ฌ์„ฑ (1/3)
ํƒœ์Šคํฌ ๋ณ‘๋ ฌ์„ฑ : ๊ธฐ๋Šฅ์  ๋ถ„ํ•ด

Problem Instruction Set

Task 1

Task 2

Task 3

Task 4
ํƒœ์Šคํฌ ๋ณ‘๋ ฌ์„ฑ (2/3)
์ฝ”๋“œ ์˜ˆ) : (OpenMP)

Serial Code

Parallel Code

PROGRAM MAIN
โ€ฆ
CALL interpolate()
CALL compute_stats()
CALL gen_random_params()
โ€ฆ
END

PROGRAM MAIN
โ€ฆ
!$OMP PARALLEL
!$OMP SECTIONS
CALL interpolate()
!$OMP SECTION
CALL compute_stats()
!$OMP SECTION
CALL gen_random_params()
!$OMP END SECTIONS
!$OMP END PARALLEL
โ€ฆ
END
ํƒœ์Šคํฌ ๋ณ‘๋ ฌ์„ฑ (3/3)
ํƒœ์Šคํฌ ๋ถ„ํ•ด (3๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ์—์„œ ๋™์‹œ ์ˆ˜ํ–‰)

Process

Code

Proc0

CALL interpolate()

Proc1

CALL compute_stats()

Proc2

CALL gen_random_params()
๋ณ‘๋ ฌ ์•„ํ‚คํ…์ฒ˜ (1/2)

Processor Organizations

Single Instruction,
Single Instruction,
Single Data Stream Multiple Data Stream
(SISD)
(SIMD)

Multiple Instruction, Multiple Instruction,
Single Data Stream Multiple Data Stream
(MIMD)
(MISD)

Uniprocessor
Vector
Processor

Shared memory
Array
Processor (tightly coupled)

Distributed memory
(loosely coupled)

Clusters
Symmetric
multiprocessor
(SMP)

Non-uniform
Memory
Access
(NUMA)
๋ณ‘๋ ฌ ์•„ํ‚คํ…์ฒ˜ (2/2)
์ตœ๊ทผ์˜ ๊ณ ์„ฑ๋Šฅ ์‹œ์Šคํ…œ : ๋ถ„์‚ฐ-๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ง€์›
๏‚ง

์†Œํ”„ํŠธ ์›จ์–ด์  DSM (Distributed Shared Memory) ๊ตฌํ˜„

โ€ข ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์—์„œ ๋ฉ”์‹œ์ง€ ํŒจ์‹ฑ ์ง€์›
โ€ข ๋ถ„์‚ฐ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์—์„œ ๋ณ€์ˆ˜ ๊ณต์œ  ์ง€์›
๏‚ง

ํ•˜๋“œ์›จ์–ด์  DSM ๊ตฌํ˜„ : ๋ถ„์‚ฐ-๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜

โ€ข ๋ถ„์‚ฐ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์˜ ๊ฐ ๋…ธ๋“œ๋ฅผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์œผ๋กœ ๊ตฌ์„ฑ
โ€ข NUMA : ์‚ฌ์šฉ์ž๋“ค์—๊ฒŒ ํ•˜๋‚˜์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜๋กœ ๋ณด์—ฌ์ง
ex) Superdome(HP), Origin 3000(SGI)
โ€ข SMP ํด๋Ÿฌ์Šคํ„ฐ : SMP๋กœ ๊ตฌ์„ฑ๋œ ๋ถ„์‚ฐ ์‹œ์Šคํ…œ์œผ๋กœ ๋ณด์—ฌ์ง
ex) SP(IBM), Beowulf Clusters
๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ
๊ณต์œ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ
๏‚ง
๏‚ง
๏‚ง

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜์— ์ ํ•ฉ
๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ํ”„๋กœ๊ทธ๋žจ
OpenMP, Pthreads

๋ฉ”์‹œ์ง€ ํŒจ์‹ฑ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ
๏‚ง
๏‚ง

๋ถ„์‚ฐ ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜์— ์ ํ•ฉ
MPI, PVM

ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ
๏‚ง
๏‚ง

๋ถ„์‚ฐ-๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜
OpenMP + MPI
๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ

Single thread
time

time

S1

Multi-thread
Thread

S1

fork

P1

P2

P1
P2
P3

P3

join

S2
S2

Shared address space

P4
Process

S2

Process

P4
๋ฉ”์‹œ์ง€ ํŒจ์‹ฑ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ

Serial
time

time

S1

Messagepassing

S1

S1

S1

S1

P1

P1

P2

P3

P4

P2

S2
S2

S2
S2

S2
S2

S2
S2

Process 0

Process 1

Process 2

Process 3

Node 1

Node 2

Node 3

Node 4

P3
P4
S2
S2
Process

Data transmission over the interconnect
ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ

Message-passing

P1

fork

P2

time

time

S1

Thread

S1

P3

Shared
address

fork

P4
join

join

S2
S2

Thread

S2
S2

Shared
address

Process 0

Process 1

Node 1

Node 2
DSM ์‹œ์Šคํ…œ์˜ ๋ฉ”์‹œ์ง€ ํŒจ์‹ฑ

time

S1

S1

S1

S1

P1

P2

P3

P4

Message-passing
S2
S2

S2
S2

S2
S2

S2
S2

Process 0

Process 1

Process 2

Process 3

Node 1

Node 2
SPMD์™€ MPMD (1/4)

SPMD(Single Program Multiple Data)
๏‚ง

ํ•˜๋‚˜์˜ ํ”„๋กœ๊ทธ๋žจ์ด ์—ฌ๋Ÿฌ ํ”„๋กœ์„ธ์Šค์—์„œ ๋™์‹œ์— ์ˆ˜ํ–‰๋จ

๏‚ง

์–ด๋–ค ์ˆœ๊ฐ„ ํ”„๋กœ์„ธ์Šค๋“ค์€ ๊ฐ™์€ ํ”„๋กœ๊ทธ๋žจ๋‚ด์˜ ๋ช…๋ น์–ด๋“ค์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ ๊ทธ ๋ช…๋ น์–ด๋“ค์€ ๊ฐ™์„ ์ˆ˜๋„
๋‹ค๋ฅผ ์ˆ˜๋„ ์žˆ์Œ

MPMD (Multiple Program Multiple Data)
๏‚ง

ํ•œ MPMD ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์‹คํ–‰ ํ”„๋กœ๊ทธ๋žจ์œผ๋กœ ๊ตฌ์„ฑ

๏‚ง

์‘์šฉํ”„๋กœ๊ทธ๋žจ์ด ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰๋  ๋•Œ ๊ฐ ํ”„๋กœ์„ธ์Šค๋Š” ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์Šค์™€ ๊ฐ™๊ฑฐ๋‚˜ ๋‹ค๋ฅธ ํ”„๋กœ๊ทธ๋žจ์„

์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Œ
SPMD์™€ MPMD (2/4)
SPMD

a.out

Node 1

Node 2

Node 3
SPMD์™€ MPMD (3/4)

MPMD : Master/Worker (Self-Scheduling)

a.out

Node 1

b.out

Node 2

Node 3
SPMD์™€ MPMD (4/4)
MPMD: Coupled Analysis

a.out

b.out

c.out

Node 1

Node 2

Node 3
โ€ข์„ฑ๋Šฅ์ธก์ •
โ€ข์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ์š”์ธ๋“ค
โ€ข๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ ์ž‘์„ฑ์ˆœ์„œ
ํ”„๋กœ๊ทธ๋žจ ์‹คํ–‰์‹œ๊ฐ„ ์ธก์ • (1/2)
time
์‚ฌ์šฉ๋ฐฉ๋ฒ•(bash, ksh) : $time [executable]

$ time mpirun โ€“np 4 โ€“machinefile machines ./exmpi.x
real 0m3.59s
user 0m3.16s
sys
0m0.04s
๏‚ง

real = wall-clock time

๏‚ง

User = ํ”„๋กœ๊ทธ๋žจ ์ž์‹ ๊ณผ ํ˜ธ์ถœ๋œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‹คํ–‰์— ์‚ฌ์šฉ๋œ CPU ์‹œ๊ฐ„

๏‚ง

Sys = ํ”„๋กœ๊ทธ๋žจ์— ์˜ํ•ด ์‹œ์Šคํ…œ ํ˜ธ์ถœ์— ์‚ฌ์šฉ๋œ CPU ์‹œ๊ฐ„

๏‚ง

user + sys = CPU time
ํ”„๋กœ๊ทธ๋žจ ์‹คํ–‰์‹œ๊ฐ„ ์ธก์ • (2/2)
์‚ฌ์šฉ๋ฐฉ๋ฒ•(csh) : $time [executable]

$ time testprog
1.150u 0.020s 0:01.76 66.4% 15+3981k 24+10io 0pf+0w
โ‘ 
โ‘ก
โ‘ข
โ‘ฃ
โ‘ค
โ‘ฅ
โ‘ฆ โ‘ง
โ‘  user CPU time (1.15์ดˆ)
โ‘ก system CPU time (0.02์ดˆ)
โ‘ข real time (0๋ถ„ 1.76์ดˆ)
โ‘ฃ real time์—์„œ CPU time์ด ์ฐจ์ง€ํ•˜๋Š” ์ •๋„(66.4%)
โ‘ค ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ : Shared (15Kbytes) + Unshared (3981Kbytes)
โ‘ฅ ์ž…๋ ฅ(24 ๋ธ”๋ก) + ์ถœ๋ ฅ(10 ๋ธ”๋ก)
โ‘ฆ no page faults
โ‘ง no swaps
์„ฑ๋Šฅ์ธก์ •
๋ณ‘๋ ฌํ™”๋ฅผ ํ†ตํ•ด ์–ป์–ด์ง„ ์„ฑ๋Šฅ์ด๋“์˜ ์ •๋Ÿ‰์  ๋ถ„์„
์„ฑ๋Šฅ์ธก์ •
๏‚ง ์„ฑ๋Šฅํ–ฅ์ƒ๋„
๏‚ง ํšจ์œจ
๏‚ง Cost
์„ฑ๋Šฅํ–ฅ์ƒ๋„ (1/7)
์„ฑ๋Šฅํ–ฅ์ƒ๋„ (Speed-up) : S(n)

S(n) =

์ˆœ์ฐจ ํ”„๋กœ๊ทธ๋žจ์˜ ์‹คํ–‰์‹œ๊ฐ„
=
๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ์˜ ์‹คํ–‰์‹œ๊ฐ„(n๊ฐœ ํ”„๋กœ์„ธ์„œ)

ts
tp

๏‚ง

์ˆœ์ฐจ ํ”„๋กœ๊ทธ๋žจ์— ๋Œ€ํ•œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ์˜ ์„ฑ๋Šฅ์ด๋“ ์ •๋„

๏‚ง

์‹คํ–‰์‹œ๊ฐ„ = Wall-clock time

๏‚ง

์‹คํ–‰์‹œ๊ฐ„์ด 100์ดˆ๊ฐ€ ๊ฑธ๋ฆฌ๋Š” ์ˆœ์ฐจ ํ”„๋กœ๊ทธ๋žจ์„ ๋ณ‘๋ ฌํ™” ํ•˜์—ฌ 10๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ๋กœ 50์ดˆ ๋งŒ์— ์‹คํ–‰
๋˜์—ˆ๋‹ค๋ฉด,
๏ƒจ S(10) =

100
=
50

2
์„ฑ๋Šฅํ–ฅ์ƒ๋„ (2/7)
์ด์ƒ(Ideal) ์„ฑ๋Šฅํ–ฅ์ƒ๋„ : Amdahlโ€Ÿs Law
๏‚ง f : ์ฝ”๋“œ์˜ ์ˆœ์ฐจ๋ถ€๋ถ„ (0 โ‰ค f โ‰ค 1)
๏‚ง tp = fts + (1-f)ts/n

์ˆœ์ฐจ๋ถ€๋ถ„ ์‹คํ–‰์‹œ
๊ฐ„

๋ณ‘๋ ฌ๋ถ€๋ถ„ ์‹คํ–‰์‹œ
๊ฐ„
์„ฑ๋Šฅํ–ฅ์ƒ๋„ (3/7)

ts
(1

fts
Serial section

f )t S

Parallelizable sections

1

2

n-1

n

1
2
n processes

n-1
n

tp

(1 f )t S / n
์„ฑ๋Šฅํ–ฅ์ƒ๋„ (4/7)

๏‚ง S(n) =

ts =
tp

ts
fts + (1-f)ts/n
1

S(n) =

๏‚ง

์ตœ๋Œ€ ์„ฑ๋Šฅํ–ฅ์ƒ๋„ ( n ๏ƒ  โˆž )
S(n) =

๏‚ง

f + (1-f)/n

1
f

ํ”„๋กœ์„ธ์„œ์˜ ๊ฐœ์ˆ˜๋ฅผ ์ฆ๊ฐ€ํ•˜๋ฉด, ์ˆœ์ฐจ๋ถ€๋ถ„ ํฌ๊ธฐ์˜ ์—ญ์ˆ˜์— ์ˆ˜๋ ด
์„ฑ๋Šฅํ–ฅ์ƒ๋„ (5/7)
f = 0.2, n = 4

Serial
Parallel
process 1

20

20

80

20

process 2
process 3

cannot be parallelized

process 4

can be parallelized

S(4) =

1
0.2 + (1-0.2)/4

= 2.5
์„ฑ๋Šฅํ–ฅ์ƒ๋„ (6/7)
ํ”„๋กœ์„ธ์„œ ๊ฐœ์ˆ˜ ๋Œ€ ์„ฑ๋Šฅํ–ฅ์ƒ๋„

f=0

24

Speed-up

20

16

f=0.05

12

f=0.1

8

f=0.2

4

0
0

4

8

12

16

20

number of processors, n

24
์„ฑ๋Šฅํ–ฅ์ƒ๋„ (7/7)
์ˆœ์ฐจ๋ถ€๋ถ„ ๋Œ€ ์„ฑ๋Šฅํ–ฅ์ƒ๋„

16
14

Speed-up

12

n=256

10
8
6

n=16

4
2
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Serial fraction, f

0.8

0.9

1
ํšจ์œจ
ํšจ์œจ (Efficiency) : E(n)

E(n) =

๏‚ง

ts
=

tpโ…นn

S(n)
n

ํ”„๋กœ์„ธ์„œ ๊ฐœ์ˆ˜์— ๋”ฐ๋ฅธ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ์˜ ์„ฑ๋Šฅํšจ์œจ์„ ๋‚˜ํƒ€๋ƒ„

โ€ข 10๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ๋กœ 2๋ฐฐ์˜ ์„ฑ๋Šฅํ–ฅ์ƒ :
โ€“ S(10) = 2

๏ƒ 

E(10) = 20 %

โ€ข 100๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ๋กœ 10๋ฐฐ์˜ ์„ฑ๋Šฅํ–ฅ์ƒ :
โ€“ S(100) = 10

๏ƒ 

E(100) = 10 %
Cost
Cost
Cost = ์‹คํ–‰์‹œ๊ฐ„ โ…น ํ”„๋กœ์„ธ์„œ ๊ฐœ์ˆ˜

๏‚ง

์ˆœ์ฐจ ํ”„๋กœ๊ทธ๋žจ : Cost = ts

๏‚ง

๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ : Cost = tp โ…น n =

tsn

S(n)

=

ts

E(n)

์˜ˆ) 10๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ๋กœ 2๋ฐฐ, 100๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ๋กœ 10๋ฐฐ์˜ ์„ฑ๋Šฅํ–ฅ์ƒ

ts

tp

n

S(n)

E(n)

Cost

100

50

10

2

0.2

500

100

10

100

10

0.1

1000
์‹ค์งˆ์  ์„ฑ๋Šฅํ–ฅ์ƒ์— ๊ณ ๋ คํ•  ์‚ฌํ•ญ
์‹ค์ œ ์„ฑ๋Šฅํ–ฅ์ƒ๋„ : ํ†ต์‹ ๋ถ€ํ•˜, ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ ๋ฌธ์ œ
20

80

Serial
parallel

20

20

process 1

cannot be parallelized

process 2

can be parallelized

process 3

communication overhead

process 4

Load unbalance
์„ฑ๋Šฅ์ฆ๊ฐ€๋ฅผ ์œ„ํ•œ ๋ฐฉ์•ˆ๋“ค

1.

ํ”„๋กœ๊ทธ๋žจ์—์„œ ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅํ•œ ๋ถ€๋ถ„(Coverage) ์ฆ๊ฐ€
๏‚ง ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ์„ 

2.

์ž‘์—…๋ถ€ํ•˜์˜ ๊ท ๋“ฑ ๋ถ„๋ฐฐ : ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ

3.

ํ†ต์‹ ์— ์†Œ๋น„ํ•˜๋Š” ์‹œ๊ฐ„(ํ†ต์‹ ๋ถ€ํ•˜) ๊ฐ์†Œ
์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ์š”์ธ๋“ค

Coverage : Amdahlโ€™s Law
๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ
๋™๊ธฐํ™”
ํ†ต์‹ ๋ถ€ํ•˜

์„ธ๋ถ„์„ฑ
์ž…์ถœ๋ ฅ
๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ
๋ชจ๋“  ํ”„๋กœ์„ธ์Šค๋“ค์˜ ์ž‘์—…์‹œ๊ฐ„์ด ๊ฐ€๋Šฅํ•œ ๊ท ๋“ฑํ•˜๋„๋ก ์ž‘์—…์„ ๋ถ„๋ฐฐํ•˜์—ฌ ์ž‘์—…๋Œ€๊ธฐ์‹œ๊ฐ„์„ ์ตœ์†Œํ™” ํ•˜๋Š” ๊ฒƒ
๏‚ง

๋ฐ์ดํ„ฐ ๋ถ„๋ฐฐ๋ฐฉ์‹(Block, Cyclic, Block-Cyclic) ์„ ํƒ์— ์ฃผ์˜

๏‚ง

์ด๊ธฐ์ข… ์‹œ์Šคํ…œ์„ ์—ฐ๊ฒฐ์‹œํ‚จ ๊ฒฝ์šฐ, ๋งค์šฐ ์ค‘์š”ํ•จ

๏‚ง

๋™์  ์ž‘์—…ํ• ๋‹น์„ ํ†ตํ•ด ์–ป์„ ์ˆ˜๋„ ์žˆ์Œ

task0

WORK

task1

WAIT

task2
task3

time
๋™๊ธฐํ™”
๋ณ‘๋ ฌ ํƒœ์Šคํฌ์˜ ์ƒํƒœ๋‚˜ ์ •๋ณด ๋“ฑ์„ ๋™์ผํ•˜๊ฒŒ ์„ค์ •ํ•˜๊ธฐ ์œ„ํ•œ ์กฐ์ •์ž‘์—…
๏‚ง
๏‚ง

๋Œ€ํ‘œ์  ๋ณ‘๋ ฌ๋ถ€ํ•˜ : ์„ฑ๋Šฅ์— ์•…์˜ํ–ฅ
์žฅ๋ฒฝ, ์ž ๊ธˆ, ์„ธ๋งˆํฌ์–ด(semaphore), ๋™๊ธฐํ†ต์‹  ์—ฐ์‚ฐ ๋“ฑ ์ด์šฉ

๋ณ‘๋ ฌ๋ถ€ํ•˜ (Parallel Overhead)

๏‚ง

๋ณ‘๋ ฌ ํƒœ์Šคํฌ์˜ ์‹œ์ž‘, ์ข…๋ฃŒ, ์กฐ์ •์œผ๋กœ ์ธํ•œ ๋ถ€ํ•˜

โ€ข ์‹œ์ž‘ : ํƒœ์Šคํฌ ์‹๋ณ„, ํ”„๋กœ์„ธ์„œ ์ง€์ •, ํƒœ์Šคํฌ ๋กœ๋“œ, ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋“ฑ
โ€ข ์ข…๋ฃŒ : ๊ฒฐ๊ณผ์˜ ์ทจํ•ฉ๊ณผ ์ „์†ก, ์šด์˜์ฒด์ œ ์ž์›์˜ ๋ฐ˜๋‚ฉ ๋“ฑ
โ€ข ์กฐ์ • : ๋™๊ธฐํ™”, ํ†ต์‹  ๋“ฑ
ํ†ต์‹ ๋ถ€ํ•˜ (1/4)
๋ฐ์ดํ„ฐ ํ†ต์‹ ์— ์˜ํ•ด ๋ฐœ์ƒํ•˜๋Š” ๋ถ€ํ•˜
๏‚ง

๋„คํŠธ์›Œํฌ ๊ณ ์œ ์˜ ์ง€์—ฐ์‹œ๊ฐ„๊ณผ ๋Œ€์—ญํญ ์กด์žฌ

๋ฉ”์‹œ์ง€ ํŒจ์‹ฑ์—์„œ ์ค‘์š”
ํ†ต์‹ ๋ถ€ํ•˜์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ์š”์ธ๋“ค

๏‚ง ๋™๊ธฐํ†ต์‹ ? ๋น„๋™๊ธฐ ํ†ต์‹ ?
๏‚ง ๋ธ”๋กํ‚น? ๋…ผ๋ธ”๋กํ‚น?
๏‚ง ์ ๋Œ€์  ํ†ต์‹ ? ์ง‘ํ•ฉํ†ต์‹ ?
๏‚ง ๋ฐ์ดํ„ฐ์ „์†ก ํšŸ์ˆ˜, ์ „์†กํ•˜๋Š” ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ
ํ†ต์‹ ๋ถ€ํ•˜ (2/4)

ํ†ต์‹ ์‹œ๊ฐ„ = ์ง€์—ฐ์‹œ๊ฐ„ +
๏‚ง

๋ฉ”์‹œ์ง€ ํฌ๊ธฐ
๋Œ€์—ญํญ

์ง€์—ฐ์‹œ๊ฐ„ : ๋ฉ”์‹œ์ง€์˜ ์ฒซ ๋น„ํŠธ๊ฐ€ ์ „์†ก๋˜๋Š”๋ฐ ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„

โ€ข ์†ก์‹ ์ง€์—ฐ + ์ˆ˜์‹ ์ง€์—ฐ + ์ „๋‹ฌ์ง€์—ฐ
๏‚ง

๋Œ€์—ญํญ : ๋‹จ์œ„์‹œ๊ฐ„๋‹น ํ†ต์‹  ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ์˜ ์–‘(MB/sec)

์œ ํšจ ๋Œ€์—ญํญ =

๋ฉ”์‹œ์ง€ ํฌ๊ธฐ
=

ํ†ต์‹ ์‹œ๊ฐ„

๋Œ€์—ญํญ
1+์ง€์—ฐ์‹œ๊ฐ„โ…น๋Œ€์—ญํญ/๋ฉ”์‹œ์ง€ํฌ๊ธฐ
ํ†ต์‹ ๋ถ€ํ•˜ (3/4)

Communication time

Communication Time

1/slope = Bandwidth

Latency

message size
ํ†ต์‹ ๋ถ€ํ•˜ (4/4)

Effective Bandwidth
effective bandwidth
(MB/sec)

1000

network bandwidth
100

10

1

โ€ข latency = 22 ใŽฒ
โ€ข bandwidth = 133 MB/sec

0.1

0.01
1

10

100

1000

10000

100000 1000000

message size(bytes)
์„ธ๋ถ„์„ฑ (1/2)
๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ๋‚ด์˜ ํ†ต์‹ ์‹œ๊ฐ„์— ๋Œ€ํ•œ ๊ณ„์‚ฐ์‹œ๊ฐ„์˜ ๋น„
๏‚ง

Fine-grained ๋ณ‘๋ ฌ์„ฑ

โ€ข ํ†ต์‹  ๋˜๋Š” ๋™๊ธฐํ™” ์‚ฌ์ด์˜ ๊ณ„์‚ฐ์ž‘์—…์ด ์ƒ๋Œ€์ ์œผ๋กœ ์ ์Œ
โ€ข ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ์— ์œ ๋ฆฌ
๏‚ง

Coarse-grained ๋ณ‘๋ ฌ์„ฑ

โ€ข ํ†ต์‹  ๋˜๋Š” ๋™๊ธฐํ™” ์‚ฌ์ด์˜ ๊ณ„์‚ฐ์ž‘์—…์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋งŽ์Œ
โ€ข ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ์— ๋ถˆ๋ฆฌ
์ผ๋ฐ˜์ ์œผ๋กœ Coarse-grained ๋ณ‘๋ ฌ์„ฑ์ด ์„ฑ๋Šฅ๋ฉด์—์„œ ์œ ๋ฆฌ
๏‚ง

๊ณ„์‚ฐ์‹œ๊ฐ„ < ํ†ต์‹  ๋˜๋Š” ๋™๊ธฐํ™” ์‹œ๊ฐ„

๏‚ง

์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ํ•˜๋“œ์›จ์–ด ํ™˜๊ฒฝ์— ๋”ฐ๋ผ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ
์„ธ๋ถ„์„ฑ (2/2)

time

time
Communication

Communication

Computation

Computation

(a) Fine-grained

(b) Coarse-grained
์ž…์ถœ๋ ฅ
์ผ๋ฐ˜์ ์œผ๋กœ ๋ณ‘๋ ฌ์„ฑ์„ ๋ฐฉํ•ดํ•จ
๏‚ง

์“ฐ๊ธฐ : ๋™์ผ ํŒŒ์ผ๊ณต๊ฐ„์„ ์ด์šฉํ•  ๊ฒฝ์šฐ ๊ฒน์ณ ์“ฐ๊ธฐ ๋ฌธ์ œ

๏‚ง

์ฝ๊ธฐ : ๋‹ค์ค‘ ์ฝ๊ธฐ ์š”์ฒญ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ํŒŒ์ผ์„œ๋ฒ„์˜ ์„ฑ๋Šฅ ๋ฌธ์ œ

๏‚ง

๋„คํŠธ์›Œํฌ๋ฅผ ๊ฒฝ์œ (NFS, non-local)ํ•˜๋Š” ์ž…์ถœ๋ ฅ์˜ ๋ณ‘๋ชฉํ˜„์ƒ

์ž…์ถœ๋ ฅ์„ ๊ฐ€๋Šฅํ•˜๋ฉด ์ค„์ผ ๊ฒƒ
๏‚ง

I/O ์ˆ˜ํ–‰์„ ํŠน์ • ์ˆœ์ฐจ์˜์—ญ์œผ๋กœ ์ œํ•œํ•ด ์‚ฌ์šฉ

๏‚ง

์ง€์—ญ์ ์ธ ํŒŒ์ผ๊ณต๊ฐ„์—์„œ I/O ์ˆ˜ํ–‰

๋ณ‘๋ ฌ ํŒŒ์ผ์‹œ์Šคํ…œ์˜ ๊ฐœ๋ฐœ (GPFS, PVFS, PPFSโ€ฆ)
๋ณ‘๋ ฌ I/O ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ธํ„ฐํŽ˜์ด์Šค ๊ฐœ๋ฐœ (MPI-2 : MPI I/O)
ํ™•์žฅ์„ฑ (1/2)
ํ™•์žฅ๋œ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์ด๋“์„ ๋ˆ„๋ฆด ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ
๏‚ง

ํ•˜๋“œ์›จ์–ด์  ํ™•์žฅ์„ฑ

๏‚ง

์•Œ๊ณ ๋ฆฌ์ฆ˜์  ํ™•์žฅ์„ฑ

ํ™•์žฅ์„ฑ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์ฃผ์š” ํ•˜๋“œ์›จ์–ด์  ์š”์ธ

๏‚ง

CPU-๋ฉ”๋ชจ๋ฆฌ ๋ฒ„์Šค ๋Œ€์—ญํญ

๏‚ง

๋„คํŠธ์›Œํฌ ๋Œ€์—ญํญ

๏‚ง

๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰

๏‚ง

ํ”„๋กœ์„ธ์„œ ํด๋Ÿญ ์†๋„
Speedu
p

ํ™•์žฅ์„ฑ (2/2)

Number of Workers
์˜์กด์„ฑ๊ณผ ๊ต์ฐฉ
๋ฐ์ดํ„ฐ ์˜์กด์„ฑ : ํ”„๋กœ๊ทธ๋žจ์˜ ์‹คํ–‰ ์ˆœ์„œ๊ฐ€ ์‹คํ–‰ ๊ฒฐ๊ณผ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๊ฒƒ

DO k = 1, 100
F(k + 2) = F(k +1) + F(k)
ENDDO
๊ต์ฐฉ : ๋‘˜ ์ด์ƒ์˜ ํ”„๋กœ์„ธ์Šค๋“ค์ด ์„œ๋กœ ์ƒ๋Œ€๋ฐฉ์˜ ์ด๋ฒคํŠธ ๋ฐœ์ƒ์„ ๊ธฐ๋‹ค๋ฆฌ๋Š” ์ƒํƒœ

Process 1
X = 4
SOURCE = TASK2
RECEIVE (SOURCE,Y)
DEST = TASK2
SEND (DEST,X)
Z = X + Y

Process 2
Y = 8
SOURCE = TASK1
RECEIVE (SOURCE,X)
DEST = TASK1
SEND (DEST,Y)
Z = X + Y
์˜์กด์„ฑ

F(1)

F(2)

F(3)

F(4)

F(5)

F(6)

F(7)

โ€ฆ

F(n)

1

2

3

4

5

6

7

โ€ฆ

n

DO k = 1, 100
F(k + 2) = F(k +1) + F(k)
ENDDO
Serial

F(1)

F(2)

F(3)

F(4)

F(5)

F(6)

F(7)

โ€ฆ

F(n)

1

2

3

5

8

13

21

โ€ฆ

โ€ฆ

F(1)

F(2)

F(3)

F(4)

F(5)

F(6)

F(7)

โ€ฆ

F(n)

1

2

3

5(4)

7

11

18

โ€ฆ

โ€ฆ

Parallel
๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ ์ž‘์„ฑ ์ˆœ์„œ
โ‘ 

์ˆœ์ฐจ์ฝ”๋“œ ์ž‘์„ฑ, ๋ถ„์„(ํ”„๋กœํŒŒ์ผ๋ง), ์ตœ์ ํ™”
๏‚ง
๏‚ง

โ‘ก

hotspot, ๋ณ‘๋ชฉ์ง€์ , ๋ฐ์ดํ„ฐ ์˜์กด์„ฑ ๋“ฑ์„ ํ™•์ธ
๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ/ํƒœ์Šคํฌ ๋ณ‘๋ ฌ์„ฑ ?

๋ณ‘๋ ฌ์ฝ”๋“œ ๊ฐœ๋ฐœ
๏‚ง

MPI/OpenMP/โ€ฆ ?

๏‚ง

ํƒœ์Šคํฌ ํ• ๋‹น๊ณผ ์ œ์–ด, ํ†ต์‹ , ๋™๊ธฐํ™” ์ฝ”๋“œ ์ถ”๊ฐ€

โ‘ข

์ปดํŒŒ์ผ, ์‹คํ–‰, ๋””๋ฒ„๊น…

โ‘ฃ

๋ณ‘๋ ฌ์ฝ”๋“œ ์ตœ์ ํ™”
๏‚ง

์„ฑ๋Šฅ์ธก์ •๊ณผ ๋ถ„์„์„ ํ†ตํ•œ ์„ฑ๋Šฅ๊ฐœ์„ 
๋””๋ฒ„๊น…๊ณผ ์„ฑ๋Šฅ๋ถ„์„
๋””๋ฒ„๊น…
๏‚ง

์ฝ”๋“œ ์ž‘์„ฑ์‹œ ๋ชจ๋“ˆํ™” ์ ‘๊ทผ ํ•„์š”

๏‚ง

ํ†ต์‹ , ๋™๊ธฐํ™”, ๋ฐ์ดํ„ฐ ์˜์กด์„ฑ, ๊ต์ฐฉ ๋“ฑ์— ์ฃผ์˜

๏‚ง

๋””๋ฒ„๊ฑฐ : TotalView

์„ฑ๋Šฅ์ธก์ •๊ณผ ๋ถ„์„
๏‚ง

timer ํ•จ์ˆ˜ ์‚ฌ์šฉ

๏‚ง

ํ”„๋กœํŒŒ์ผ๋Ÿฌ : prof, gprof, pgprof, TAU
Coffee break
I. Introduction to Parallel Computing
OpenMP๋ž€ ๋ฌด์—‡์ธ๊ฐ€?

๊ณต์œ ๋ฉ”๋ชจ๋ฆฌ ํ™˜๊ฒฝ์—์„œ
๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ ์ž‘์„ฑ์„ ์œ„ํ•œ
์‘์šฉํ”„๋กœ๊ทธ๋žจ ์ธํ„ฐํŽ˜์ด์Šค(API)
OpenMP์˜ ์—ญ์‚ฌ
1990๋…„๋Œ€ :
๏‚ง ๊ณ ์„ฑ๋Šฅ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์˜ ๋ฐœ์ „
๏‚ง ์—…์ฒด ๊ณ ์œ ์˜ ์ง€์‹œ์–ด ์ง‘ํ•ฉ ์‚ฌ์šฉ ๏ƒ  ํ‘œ์ค€ํ™”์˜ ํ•„์š”์„ฑ

1994๋…„ ANSI X3H5 ๏ƒ  1996๋…„ openmp.org ์„ค๋ฆฝ
1997๋…„ OpenMP API ๋ฐœํ‘œ
Release History
๏‚ง OpenMP Fortran API ๋ฒ„์ „ 1.0 : 1997๋…„ 10์›”
๏‚ง C/C++ API ๋ฒ„์ „ 1.0 : 1998๋…„ 10์›”
๏‚ง Fortran API ๋ฒ„์ „ 1.1 : 1999๋…„ 11์›”
๏‚ง Fortran API ๋ฒ„์ „ 2.0 : 2000๋…„ 11์›”
๏‚ง C/C++ API ๋ฒ„์ „ 2.0 : 2002๋…„ 3์›”
๏‚ง Combined C/C++ and Fortran API ๋ฒ„์ „ 2.5 : 2005๋…„ 5์›”
๏‚ง API ๋ฒ„์ „ 3.0 : 2008๋…„ 5์›”
OpenMP์˜ ๋ชฉํ‘œ

ํ‘œ์ค€๊ณผ ์ด์‹์„ฑ
๊ณต์œ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํ‘œ์ค€
๋Œ€๋ถ€๋ถ„์˜ Unix์™€ Windows์— OpenMP ์ปดํŒŒ์ผ๋Ÿฌ ์กด์žฌ
Fortran, C/C++ ์ง€์›
OpenMP์˜ ๊ตฌ์„ฑ (1/2)

Directives

Runtime
Library

Environment
Variables
OpenMP์˜ ๊ตฌ์„ฑ (2/2)
์ปดํŒŒ์ผ๋Ÿฌ ์ง€์‹œ์–ด
๏‚ง

์Šค๋ ˆ๋“œ ์‚ฌ์ด์˜ ์ž‘์—…๋ถ„๋‹ด, ํ†ต์‹ , ๋™๊ธฐํ™”๋ฅผ ๋‹ด๋‹น

๏‚ง

์ข์€ ์˜๋ฏธ์˜ OpenMP

์˜ˆ) C$OMP PARALLEL DO
์‹คํ–‰์‹œ๊ฐ„ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
๏‚ง

๋ณ‘๋ ฌ ๋งค๊ฐœ๋ณ€์ˆ˜(์ฐธ์—ฌ ์Šค๋ ˆ๋“œ์˜ ๊ฐœ์ˆ˜, ๋ฒˆํ˜ธ ๋“ฑ)์˜ ์„ค์ •๊ณผ ์กฐํšŒ

์˜ˆ) CALL omp_set_num_threads(128)
ํ™˜๊ฒฝ๋ณ€์ˆ˜
๏‚ง

์‹คํ–‰ ์‹œ์Šคํ…œ์˜ ๋ณ‘๋ ฌ ๋งค๊ฐœ๋ณ€์ˆ˜(์Šค๋ ˆ๋“œ ๊ฐœ์ˆ˜ ๋“ฑ)๋ฅผ ์ •์˜

์˜ˆ) export OMP_NUM_THREADS=8
OpenMP ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ (1/4)
์ปดํŒŒ์ผ๋Ÿฌ ์ง€์‹œ์–ด ๊ธฐ๋ฐ˜
๏‚ง

์ˆœ์ฐจ์ฝ”๋“œ์˜ ์ ์ ˆํ•œ ์œ„์น˜์— ์ปดํŒŒ์ผ๋Ÿฌ ์ง€์‹œ์–ด ์‚ฝ์ž…

๏‚ง

์ปดํŒŒ์ผ๋Ÿฌ๊ฐ€ ์ง€์‹œ์–ด๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ์ฝ”๋“œ ์ƒ์„ฑ

๏‚ง

OpenMP๋ฅผ ์ง€์›ํ•˜๋Š” ์ปดํŒŒ์ผ๋Ÿฌ ํ•„์š”

๏‚ง

๋™๊ธฐํ™”, ์˜์กด์„ฑ ์ œ๊ฑฐ ๋“ฑ์˜ ์ž‘์—… ํ•„์š”
OpenMP ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ (2/4)
Fork-Join
๏‚ง
๏‚ง

๋ณ‘๋ ฌํ™”๊ฐ€ ํ•„์š”ํ•œ ๋ถ€๋ถ„์— ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ์ƒ์„ฑ
๋ณ‘๋ ฌ๊ณ„์‚ฐ์„ ๋งˆ์น˜๋ฉด ๋‹ค์‹œ ์ˆœ์ฐจ์ ์œผ๋กœ ์‹คํ–‰

F

J

F

J

O

O

O

O

Master

R

I

R

I

Thread

K

N

K

N

[Parallel Region]

[Parallel Region]
OpenMP ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ (3/4)
์ปดํŒŒ์ผ๋Ÿฌ ์ง€์‹œ์–ด ์‚ฝ์ž…

Serial Code
PROGRAM exam
โ€ฆ
ialpha = 2
DO i = 1, 100
a(i) = a(i) + ialpha*b(i)
ENDDO
PRINT *, a
END

Parallel Code
PROGRAM exam
โ€ฆ
ialpha = 2
!$OMP PARALLEL DO
DO i = 1, 100
a(i) = a(i) + ialpha*b(i)
ENDDO
!$OMP END PARALLEL DO
PRINT *, a
END
OpenMP ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ (4/4)

Fork-Join
โ€ป export OMP_NUM_THREADS = 4

ialpha = 2

(Master Thread)

(Fork)
DO i=1,25

DO i=26,50

DO i=51,75

DO i=76,100

...

...

...

...

(Join)

(Master)

PRINT *, a

(Slave)

(Master Thread)

(Slave)

(Slave)
OpenMP์˜ ์žฅ์ ๊ณผ ๋‹จ์ 

์žฅ ์ 
๏‚– MPI๋ณด๋‹ค ์ฝ”๋”ฉ, ๋””๋ฒ„๊น…์ด ์‰ฌ์›€
๏‚– ๋ฐ์ดํ„ฐ ๋ถ„๋ฐฐ๊ฐ€ ์ˆ˜์›”

๋‹จ ์ 
โ€ข ๊ณต์œ ๋ฉ”๋ชจ๋ฆฌํ™˜๊ฒฝ์˜ ๋‹ค์ค‘ ํ”„๋กœ์„ธ์„œ
์•„ํ‚คํ…์ฒ˜์—์„œ๋งŒ ๊ตฌํ˜„ ๊ฐ€๋Šฅ

๏‚– ์ ์ง„์  ๋ณ‘๋ ฌํ™”๊ฐ€ ๊ฐ€๋Šฅ

โ€ข OpenMP๋ฅผ ์ง€์›ํ•˜๋Š” ์ปดํŒŒ์ผ๋Ÿฌ ํ•„์š”

๏‚– ํ•˜๋‚˜์˜ ์ฝ”๋“œ๋ฅผ ๋ณ‘๋ ฌ์ฝ”๋“œ์™€ ์ˆœ์ฐจ์ฝ”

โ€ข ๋ฃจํ”„์— ๋Œ€ํ•œ ์˜์กด๋„๊ฐ€ ํผ ๏ƒ  ๋‚ฎ์€

๋“œ๋กœ ์ปดํŒŒ์ผ ๊ฐ€๋Šฅ
๏‚– ์ƒ๋Œ€์ ์œผ๋กœ ์ฝ”๋“œ ํฌ๊ธฐ๊ฐ€ ์ž‘์Œ

๋ณ‘๋ ฌํ™” ํšจ์œจ์„ฑ
โ€ข ๊ณต์œ ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜์˜ ํ™•์žฅ์„ฑ
(ํ”„๋กœ์„ธ์„œ ์ˆ˜, ๋ฉ”๋ชจ๋ฆฌ ๋“ฑ) ํ•œ๊ณ„
OpenMP์˜ ์ „ํ˜•์  ์‚ฌ์šฉ
๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ์„ ์ด์šฉํ•œ ๋ฃจํ”„์˜ ๋ณ‘๋ ฌํ™”
1. ์‹œ๊ฐ„์ด ๋งŽ์ด ๊ฑธ๋ฆฌ๋Š” ๋ฃจํ”„๋ฅผ ์ฐพ์Œ (ํ”„๋กœํŒŒ์ผ๋ง)
2. ์˜์กด์„ฑ, ๋ฐ์ดํ„ฐ ์œ ํšจ๋ฒ”์œ„ ์กฐ์‚ฌ
3. ์ง€์‹œ์–ด ์‚ฝ์ž…์œผ๋กœ ๋ณ‘๋ ฌํ™”
ํƒœ์Šคํฌ ๋ณ‘๋ ฌ์„ฑ์„ ์ด์šฉํ•œ ๋ณ‘๋ ฌํ™”๋„ ๊ฐ€๋Šฅ
์ง€์‹œ์–ด (1/5)
OpenMP ์ง€์‹œ์–ด ๋ฌธ๋ฒ•

Fortran

(๊ณ ์ •ํ˜•์‹:f77)
์ง€์‹œ์–ด ์‹œ์ž‘
(๊ฐ์‹œ๋ฌธ์ž)
์ค„ ๋ฐ”๊ฟˆ
์„ ํƒ์ 
์ปดํŒŒ์ผ
์‹œ์ž‘์œ„์น˜

Fortran

(์ž์œ ํ˜•์‹:f90)

C

โ–ช !$OMP <์ง€์‹œ์–ด>
โ–ช C$OMP <์ง€์‹œ์–ด>

โ–ช !$OMP <์ง€์‹œ์–ด>

โ–ช #pragma omp

โ–ช !$OMP <์ง€์‹œ์–ด> &

โ–ช #pragma omp โ€ฆ 

โ–ช *$OMP <์ง€์‹œ์–ด>

โ–ช !$OMP <์ง€์‹œ์–ด>
!$OMP& โ€ฆ

โ€ฆ

โ€ฆ

โ–ช !$ โ€ฆ
โ–ช C$ โ€ฆ

โ–ช !$ โ€ฆ

โ–ช #ifdef _OPENMP

โ–ช *$ โ€ฆ
์ฒซ๋ฒˆ์งธ ์—ด

๋ฌด๊ด€

๋ฌด๊ด€
์ง€์‹œ์–ด (2/5)
๋ณ‘๋ ฌ์˜์—ญ ์ง€์‹œ์–ด
๏‚ง
๏‚ง
๏‚ง

PARALLEL/END PARALLEL
์ฝ”๋“œ๋ถ€๋ถ„์„ ๋ณ‘๋ ฌ์˜์—ญ์œผ๋กœ ์ง€์ •
์ง€์ •๋œ ์˜์—ญ์€ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ์—์„œ ๋™์‹œ์— ์‹คํ–‰๋จ

์ž‘์—…๋ถ„ํ•  ์ง€์‹œ์–ด
๏‚ง
๏‚ง
๏‚ง

DO/FOR
๋ณ‘๋ ฌ์˜์—ญ ๋‚ด์—์„œ ์‚ฌ์šฉ
๋ฃจํ”„์ธ๋ฑ์Šค๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ฐ ์Šค๋ ˆ๋“œ์—๊ฒŒ ๋ฃจํ”„์ž‘์—… ํ• ๋‹น

๊ฒฐํ•ฉ๋œ ๋ณ‘๋ ฌ ์ž‘์—…๋ถ„ํ•  ์ง€์‹œ์–ด
๏‚ง
๏‚ง

PARALLEL DO/FOR
PARALLEL + DO/FOR์˜ ์—ญํ• ์„ ์ˆ˜ํ–‰
์ง€์‹œ์–ด (3/5)
๋ณ‘๋ ฌ์˜์—ญ ์ง€์ •

Fortran
!$OMP PARALLEL
DO i = 1, 10
PRINT *, โ€žHello Worldโ€Ÿ, i
ENDDO
!$OMP END PARALLEL

C
#pragma omp parallel
for(i=1; i<=10; i++)
printf(โ€œHello World %dnโ€,i);
์ง€์‹œ์–ด (4/5)
๋ณ‘๋ ฌ์˜์—ญ๊ณผ ์ž‘์—…๋ถ„ํ• 

Fortran

C

!$OMP PARALLEL

#pragma omp parallel

!$OMP DO
DO i = 1, 10
PRINT *, โ€žHello Worldโ€Ÿ, i
ENDDO
[!$OMP END DO]
!$OMP END PARALLEL

{
#pragma omp for
for(i=1; i<=10; i++)

printf(โ€œHello World %dnโ€,i);
}
์ง€์‹œ์–ด (5/5)
๋ณ‘๋ ฌ์˜์—ญ๊ณผ ์ž‘์—…๋ถ„ํ• 

Fortran
!$OMP PARALLEL
!$OMP DO
DO i = 1, n
a(i) = b(i) + c(i)
ENDDO
[!$OMP END DO]
Optional
!$OMP DO
โ€ฆ
[!$OMP END DO]
!$OMP END PARALLEL

C
#pragma omp parallel
{
#pragma omp for
for (i=1; i<=n; i++) {
a[i] = b[i] + c[i]
}
#pragma omp for
for(โ€ฆ){
โ€ฆ
}
}
์‹คํ–‰์‹œ๊ฐ„ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์™€ ํ™˜๊ฒฝ๋ณ€์ˆ˜ (1/3)
์‹คํ–‰์‹œ๊ฐ„ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
๏‚ง
๏‚ง
๏‚ง

omp_set_num_threads(integer) : ์Šค๋ ˆ๋“œ ๊ฐœ์ˆ˜ ์ง€์ •
omp_get_num_threads() : ์Šค๋ ˆ๋“œ ๊ฐœ์ˆ˜ ๋ฆฌํ„ด
omp_get_thread_num() : ์Šค๋ ˆ๋“œ ID ๋ฆฌํ„ด

ํ™˜๊ฒฝ๋ณ€์ˆ˜
๏‚ง

OMP_NUM_THREADS : ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์Šค๋ ˆ๋“œ ์ตœ๋Œ€ ๊ฐœ์ˆ˜

โ€ข export OMP_NUM_THREADS=16 (ksh)
โ€ข setenv OMP_NUM_THREADS 16 (csh)
C : #include <omp.h>
์‹คํ–‰์‹œ๊ฐ„ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์™€ ํ™˜๊ฒฝ๋ณ€์ˆ˜ (3/3)
omp_set_num_threads
omp_get_thread_num

INTEGER OMP_GET_THREAD_NUM

CALL OMP_SET_NUM_THREADS(4)

Fortran

!$OMP PARALLEL
PRINT*, โ€ฒThread rank: โ€ฒ, OMP_GET_THREAD_NUM()
!$OMP END PARALLEL

#include <omp.h>
omp_set_num_threads(4);

C

#pragma omp parallel
{
printf(โ€ณThread rank:%d๏ผผnโ€ณ,omp_get_thread_num());

}
์ฃผ์š” Clauses

private(var1, var2, โ€ฆ)
shared(var1, var2, โ€ฆ)

default(shared|private|none)
firstprivate(var1, var2, โ€ฆ)
lastprivate(var1, var2, โ€ฆ)
reduction(operator|intrinsic:var1, var2,โ€ฆ)
schedule(type [,chunk])
clause : reduction (1/4)
reduction(operator|intrinsic:var1, var2,โ€ฆ)
๏‚ง

reduction ๋ณ€์ˆ˜๋Š” shared

โ€ข ๋ฐฐ์—ด ๊ฐ€๋Šฅ(Fortran only): deferred shape, assumed shape array ์‚ฌ
์šฉ ๋ถˆ๊ฐ€
โ€ข C๋Š” scalar ๋ณ€์ˆ˜๋งŒ ๊ฐ€๋Šฅ
๏‚ง

๊ฐ ์Šค๋ ˆ๋“œ์— ๋ณต์ œ๋ผ ์—ฐ์‚ฐ์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๊ฐ’์œผ๋กœ ์ดˆ๊ธฐํ™”๋˜๊ณ (ํ‘œ ์ฐธ์กฐ) ๋ณ‘๋ ฌ ์—ฐ์‚ฐ ์ˆ˜ํ–‰

๏‚ง

๋‹ค์ค‘ ์Šค๋ ˆ๋“œ์—์„œ ๋ณ‘๋ ฌ๋กœ ์ˆ˜ํ–‰๋œ ๊ณ„์‚ฐ๊ฒฐ๊ณผ๋ฅผ ํ™˜์‚ฐํ•ด ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๋งˆ์Šคํ„ฐ ์Šค๋ ˆ๋“œ๋กœ ๋‚ด ๋†“
์Œ
clause : reduction (2/4)
!$OMP DO reduction(+:sum)
DO i = 1, 100
sum = sum + x(i)

ENDDO

Thread 0

Thread 1

sum0 = 0

sum1 = 0

DO i = 1, 50

DO i = 51, 100

sum0 = sum0 + x(i)
ENDDO

sum = sum0 + sum1

sum1 = sum1 + x(i)
ENDDO
clause : reduction (3/4)
Reduction Operators : Fortran

Operator

Data Types

์ดˆ๊ธฐ๊ฐ’

+

integer, floating point (complex or real)

0

*

integer, floating point (complex or real)

1

-

integer, floating point (complex or real)

0

.AND.

logical

.TRUE.

.OR.

logical

.FALSE.

.EQV.

logical

.TRUE.

.NEQV.

logical

.FALSE.

MAX

integer, floating point (real only)

๊ฐ€๋Šฅํ•œ ์ตœ์†Œ๊ฐ’

MIN

integer, floating point (real only)

๊ฐ€๋Šฅํ•œ ์ตœ๋Œ€๊ฐ’

IAND

integer

all bits on

IOR

integer

0

IEOR

integer

0
clause : reduction (4/4)
Reduction Operators : C

Operator

Data Types

์ดˆ๊ธฐ๊ฐ’

+

integer, floating point

0

*

integer, floating point

1

-

integer, floating point

0

&

integer

all bits on

|

integer

0

^

integer

0

&&

integer

1

||

integer

0
Coffee break
III. Parallel Programming using MPI
Current HPC Platforms : COTS-Based Clusters

COTS = Commercial off-the-shelf

Nehalem

Access
Control

File
Server(s)

Gulftown

โ€ฆ

Login Node(s)

88

Compute Nodes
Memory Architectures

Shared Memory
๏‚ง

Single address space for all processors

<NUMA>
<UMA>

Distributed Memory

89
What is MPI?
MPI = Message Passing Interface
MPI is a specification for the developers and users of message passing libraries. By itself, it
is NOT a library โ€“ but rather the specification of what such a library should be.
MPI primarily addresses the message-passing parallel programming model : data is moved
from the address space of one process to that of another process through cooperative
operations on each process.
Simply stated, the goal of the message Passing Interface is to provide a widely used standard
for writing message passing programs. The interface attempts to be :
๏‚ง
๏‚ง

Portable

๏‚ง

Efficient

๏‚ง

90

Practical

Flexible
What is MPI?
The MPI standard has gone through a number of revisions, with the most recent version
being MPI-3.
Interface specifications have been defined for C and Fortran90 language bindings :
๏‚ง

C++ bindings from MPI-1 are removed in MPI-3

๏‚ง

MPI-3 also provides support for Fortran 2003 and 2008 features

Actual MPI library implementations differ in which version and features of the MPI standard
they support. Developers/users will need to be aware of this.

91
Programming Model
Originally, MPI was designed for distributed memory architectures, which were becoming
increasingly popular at time (1980s โ€“ early 1990s).

As architecture trends changed, shared memory SMPs were combined over networks
creating hybrid distributed memory/shared memory systems.

92
Programming Model
MPI implementers adapted their libraries to handle both types of underlying memory
architectures seamlessly. They also adapted/developed ways of handing different
interconnects and protocols.

Today, MPI runs on virtually any hardware platform :
๏‚ง

Distributed Memory

๏‚ง

Shared Memory

๏‚ง

Hybrid

The programming model clearly remains a distributed memory model however, regardless of
the underlying physical architecture of the machine.

93
Reasons for Using MPI
Standardization
๏‚ง

MPI is the only message passing library which can be considered a standard. It is
supported on virtually all HPC platforms. Practically, it has replaced all previous
message passing libraries.

Portability
๏‚ง

There is little or no need to modify your source code when you port your application to a
different platform that supports (and is compliant with) the MPI standard.

Performance Opportunities
๏‚ง

Vendor implementations should be able to exploit native hardware features to optimize
performance.

Functionality
๏‚ง

There are over 440 routines defined in MPI-3, which includes the majority of those in
MPI-2 and MPI-1.

Availability
๏‚ง

94

A Variety of implementations are available, both vendor and public domain.
History and Evolution
MPI has resulted from the efforts of numerous individuals and groups that began in 1992.
1980s โ€“ early 1990s : Distributed memory, parallel computing develops, as do a number of
incompatible soft ware tools for writing such programs โ€“ usually with tradeoffs between
portability, performance, functionality and price. Recognition of the need for a standard arose.
Apr 1992 : Workshop on Standards for Message Passing in a Distributed Memory
Environment, Sponsored by the Center for Research on Parallel Computing, Williamsburg,
Virginia. The basic features essential to a standard message passing interface were
discussed, and a working group established to continue the standardization process.
Preliminary draft proposal developed subsequently.

95
History and Evolution
Nov 1992 : Working group meets in Minneapolis. MPI draft proposal (MPI1) from ORNL
presented. Group adopts procedures and organization to form the MPI Forum. It eventually
comprised of about 175 individuals from 40 organizations including parallel computer
vendors, software writers, academia and application scientists.
Nov 1993 : Supercomputing 93 conference โ€“ draft MPI standard presented.
May 1994 : Final version of MPI-1.0 released.
MPI-1.0 was followed by versions MPI-1.1 (Jun 1995), MPI-1.2 (Jul 1997) and MPI-1.3 (May
2008).
MPI-2 picked up where the first MPI specification left off, and addressed topics which went far
beyond the MPI-1 specification. Was finalized in 1996.
MPI-2.1 (Sep 2009), and MPI-2.2 (Sep 2009) followed.
Sep 2012 : The MPI-3.0 standard was approved.

96
History and Evolution
Documentation for all versions of the MPI standard is available at :
๏‚ง

97

http://www.mpi-forum.org/docs/
A General Structure of the MPI Program

98
A Header File for MPI routines
Required for all programs that make MPI library calls.

C include file

Fortran include file

#include โ€œmpi.hโ€

include โ€žmpif.hโ€Ÿ

With MPI-3 Fortran, the USE mpi_f80 module is preferred over using the include file shown
above.

99
The Format of MPI Calls
C names are case sensitive; Fortran name are not.
Programs must not declare variables or functions with names beginning with the prefix MPI_
or PMPI_ (profiling interface).

C Binding

Format

rc = MPI_Xxxxx(parameter, โ€ฆ)

Example

rc = MPI_Bsend(&buf, count, type, dest, tag, comm)

Error code

Returned as โ€œrcโ€, MPI_SUCCESS if successful.
Fortran Binding

Format

Example

call MPI_BSEND(buf, count, type, dest, tag, comm, ierr)

Error code

100

CALL MPI_XXXXX(parameter, โ€ฆ, ierr)
call mpi_xxxxx(parameter, โ€ฆ, ierr)
Returned as โ€œierrโ€ parameter, MPI_SUCCESS if successful.
Communicators and Groups
MPI uses objects called communicators and groups to define which collection of processes
may communicate with each other.
Most MPI routines require you to specify a communicator as an argument.
Communicators and groups will be covered in more detail later. For now, simply use
MPI_COMM_WORLD whenever a communicator is required - it is the predefined
communicator that includes all of your MPI processes.

101
Rank
Within a communicator, every process has its own unique, integer identifier assigned by the
system when the process initializes. A rank is sometimes also called a โ€œtask IDโ€. Ranks are
contiguous and begin at zero.
Used by the programmer to specify the source and destination of messages. Often used
conditionally by the application to control program execution (if rank = 0 do this / if rank = 1
do that).

102
Error Handling
Most MPI routines include a return/error code parameter, as described in โ€œFormat of MPI
Callsโ€ section above.
However, according to the MPI standard, the default behavior of an MPI call is to abort if there
is an error. This means you will probably not be able to capture a return/error code other than
MPI_SUCCESS (zero).
The standard does provide a means to override this default error handler. You can also
consult the error handing section of the MPI Standard located at http://www.mpiforum.org/docs/mpi-11-html/node148.html .
The types of errors displayed to the user are implementation dependent.

103
Environment Management Routines
MPI_Init
๏‚ง

Initializes the MPI execution environment. This function must be called is every MPI
program, must be called before any other MPI functions and must be called only once in
an MPI program. For C programs, MPI_Init may be used to pass the command line
arguments to all processes, although this is not required by the standard and is
implementation dependent.

C
MPI_Init(&argc, &argv)
๏‚ง

๏‚ง

104

Fortran
MPI_INIT(ierr)

Input parameters
โ€ข argc : Pointer to the number of arguments
โ€ข argv : Pointer to the argument vector
ierr : the error return argument
Environment Management Routines
MPI_Comm_size
๏‚ง

Returns the total number of MPI processes in the specified communicator, such as
MPI_COMM_WORLD. If the communicator is MPI_COMM_WORLD, then it represents the
number of MPI tasks available to your application.

C
MPI_Comm_size(comm, &size)
๏‚ง
๏‚ง

๏‚ง

105

Fortran
MPI_COMM_SIZE(comm, size, ierr)

Input parameters
โ€ข comm : communicator (handle)
Output parameters
โ€ข size : number of processes in the group of comm (integer)
ierr : the error return argument
Environment Management Routines
MPI_Comm_rank
๏‚ง

Returns the rank of the calling MPI process within the specified communicator. Initially,
each process will be assigned a unique integer rank between 0 and number of tasks -1
within the communicator MPI_COMM_WORLD. This rank is often referred to as a task ID.
If a process becomes associated with other communicators, it will have a unique rank
within each of these as well.

C

MPI_Comm_rank(comm, &rank)
๏‚ง
๏‚ง
๏‚ง

106

Fortran

MPI_COMM_SIZE(comm, rank, ierr)

Input parameters
โ€ข comm : communicator (handle)
Output parameters
โ€ข rank : rank of the calling process in the group of comm (integer)
ierr : the error return argument
Environment Management Routines
MPI_Finalize
๏‚ง

Terminates the MPI execution environment. This function should be the last MPI routine
called in every MPI program โ€“ no other MPI routines may be called after it.

C
MPI_Finalize()
๏‚ง

107

ierr : the error return argument

Fortran
MPI_FINALIZE(ierr)
Environment Management Routines
MPI_Abort
๏‚ง

Terminates all MPI processes associated with the communicator. In most MPI
implementations it terminates ALL processes regardless of the communicator specified.

C
MPI_Abort(comm, errorcode)
๏‚ง

๏‚ง

108

Fortran
MPI_ABORT(comm, errorcode, ierr)

Input parameters
โ€ข comm : communicator (handle)
โ€ข errorcode : error code to return to invoking environment
ierr : the error return argument
Environment Management Routines
MPI_Get_processor_name
๏‚ง

Return the processor name. Also returns the length of the name. The buffer for โ€œnameโ€
must be at least MPI_MAX_PROCESSOR_NAME characters in size. What is returned into
โ€œnameโ€ is implementation dependent โ€“ may not be the same as the output of the
โ€œhostnameโ€ or โ€œhostโ€ shell commands.

C

Fortran

MPI_Get_processor_name(&name,
&resultlength)

MPI_GET_PROCESSOR_NAME(n
ame, resultlength, ierr)

๏‚ง

๏‚ง

109

Output parameters
โ€ข name : A unique specifies for the actual (as opposed to virtual) node. This must be
an array of size at least MPI_MAX_PROCESOR_NAME .
โ€ข resultlen : Length (in characters) of the name.
ierr : the error return argument
Environment Management Routines
MPI_Get_version
๏‚ง

Returns the version (either 1 or 2) and subversion of MPI.

C
MPI_Get_version(&version,
&subversion)
๏‚ง

๏‚ง

110

Fortran
MPI_GET_VERSION(version,
subversion, ierr)

Output parameters
โ€ข version : Major version of MPI (1 or 2)
โ€ข subversion : Miner version of MPI.
ierr : the error return argument
Environment Management Routines
MPI_Initialized
๏‚ง

Indicates whether MPI_Init has been called โ€“ returns flag as either logical true(1) or
false(0).

C
MPI_Initialized(&flag)
๏‚ง
๏‚ง

111

Fortran
MPI_INITIALIZED(flag, ierr)

Output parameters
โ€ข flag : Flag is true if MPI_Init has been called and false otherwise.
ierr : the error return argument
Environment Management Routines
MPI_Wtime
๏‚ง

Returns an elapsed wall clock time in seconds (double precision) on the calling
processor.

C
MPI_Wtime()
๏‚ง

Fortran
MPI_WTIME()

Return value
โ€ข Time in seconds since an arbitrary time in the past.

MPI_Wtick
๏‚ง

Returns the resolution in seconds (double precision) of MPI_Wtime.

C
MPI_Wtick()
๏‚ง

112

Fortran
MPI_WTICK()

Return value
โ€ข Time in seconds of the resolution MPI_Wtime.
Example: Hello world
#include<stdio.h>
#include"mpi.h"
int main(int argc, char *argv[])
{
int rc;
rc = MPI_Init(&argc, &argv);
printf("Hello world.n");
rc = MPI_Finalize();
return 0;
}

113
Example: Hello world
Execute a mpi program.
$ module load [compiler] [mpi]
$ mpicc hello.c
$ mpirun โ€“np 4 โ€“hostfile [hostfile] ./a.out

Make out a hostfile.
ibs0001
ibs0002
ibs0003
ibs0003
โ€ฆ

114

slots=2
slots=2
slots=2
slots=2
Example : Environment Management Routine
#include "mpi.hโ€
#include <stdio.h>
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, len, rc;
char hostname[MPI_MAX_PROCESSOR_NAME];
rc = MPI_Init(&argc,&argv);
if (rc != MPI_SUCCESS) {
printf ("Error starting MPI program. Terminating.n");
MPI_Abort(MPI_COMM_WORLD, rc);
}

MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Get_processor_name(hostname, &len);
printf ("Number of tasks= %d My rank= %d Running on %sn", numtasks,rank,hostname);
/*******

do some work *******/

rc = MPI_Finalize();
return 0;
}

115
Types of Point-to-Point Operations
MPI point-to-point operations typically involve message passing between two, and only two,
different MPI tasks. One task is performing a send operation and the other task is performing
a matching receive operation.
There are different types of send and receive routines used for different purposes.
๏‚ง

Synchronous send

๏‚ง

Blocking send/blocking receive

๏‚ง

Non-blocking send/non-blocking receive

๏‚ง

Buffered send

๏‚ง

Combined send/receive

๏‚ง

โ€œReadyโ€ send

Any type of send routine can be paired with any type of receive routine.
MPI also provides several routines associated with send โ€“ receive operations, such as those used to wait for
a messageโ€™s arrival or prove to find out if a message has arrived.

116
Buffering
In a perfect world, every send operation would be perfectly synchronized with its matching re
ceive. This is rarely the case. Somehow or other, the MPI implementation must be able to deal
with storing data when the two tasks are out of sync.
Consider the following two cases
๏‚ง
๏‚ง

117

A send operation occurs 5 seconds before the receive is ready โ€“ where is the message w
hile the receive is pending?
Multiple sends arrive at the same receiving task which can only accept one send at a tim
e โ€“ what happens to the messages that are โ€œbacking upโ€?
Buffering
The MPI implementation (not the MPI standard) decides what happens to data in these types
of cases. Typically, a system buffer area is reserved to hold data in transit.

118
Buffering
System buffer space is :
๏‚ง
๏‚ง
๏‚ง
๏‚ง
๏‚ง

119

Opaque to the programmer and managed entirely by the MPI library
A finite resource that can be easy to exhaust
Often mysterious and not well documented
Able to exist on the sending side, the receiving side, or both
Something that may improve program performance because it allows send โ€“ receive ope
rations to be asynchronous.
Blocking vs. Non-blocking
Most of the MPI point-to-point routines can be used in either blocking or non-blocking mode.
Blocking
๏‚ง

๏‚ง
๏‚ง
๏‚ง

A blocking send routine will only โ€œreturnโ€ after it is safe to modify the application buffer (your
send data) for reuse. Safe means that modifications will not affect the data intended for the rec
eive task. Safe dose not imply that the data was actually received โ€“ it may very well be sitting i
n a system buffer.
A blocking send can be synchronous which means there is handshaking occurring with the re
ceive task to confirm a safe send.
A blocking send can be asynchronous if a system buffer is used to hold the data for eventual d
elivery to the receive.
A blocking receive only โ€œreturnsโ€ after the data has arrived and is ready for use by the progra
m.

Non-blocking
๏‚ง
๏‚ง
๏‚ง
๏‚ง

120

Non-blocking send and receive routines behave similarly โ€“ they will return almost immediately.
They do not wait for any communication events to complete, such as message copying from u
ser memory to system buffer space or the actual arrival of message.
Non-blocking operations simply โ€œrequestโ€ the MPI library to perform the operation when it is a
ble. The user can not predict when it is able. The user can not predict when that will happen.
It is unsafe to modify the application buffer (your variable space) until you know for a fact the r
equested non-blocking operation was actually performed by the library. There are โ€œwaitโ€ routin
es used to do this.
Non-blocking communications are primarily used to overlap computation with communication
and exploit possibale performance gains.
MPI Message Passing Routine Arguments
MPI point-to-point communication routines generally have an argument list that takes one of t
he following formats :

Blocking sends

MPI_Send(buffer, count, type, dest, tag, comm)

Non-blocking sends

MPI_Isend(buffer, count, type, dest, tag, comm, request)

Blocking receive

MPI_Recv(buffer, count, type, source, tag, comm, status)

Non-blocking receive

MPI_Irecv(buffer, count, type, source, tag, comm, request)

Buffer

๏‚ง

Program (application) address space that references the data that is to be sent or receiv
ed. In most cases, this is simply the variable name that is be sent/received. For C progra
ms, this argument is passed by reference and usually must be prepended with an amper
sand : &var1

Data count
๏‚ง

121

Indicates the number of data elements of a particular type to be sent.
MPI Message Passing Routine Arguments
Data type
๏‚ง

For reasons of portability, MPI predefines its elementary data types. The table below lists
those required by the standard.

C Data Types
MPI_CHAR
MPI_SHORT

signed short int

MPI_INT

signed int

MPI_LONG

signed long int

MPI_SIGNED_CHAR

signed char

MPI_UNSIGNED_CHAR

unsigned char

MPI_UNSIGNED_SHORT

unsigned short int

MPI_UNSIGNED

unsigned int

MPI_UNSIGNED_LONG

unsigned long int

MPI_FLOAT

float

MPI_DOUBLE

double

MPI_LONG_DOUBLE

122

signed char

long double
MPI Message Passing Routine Arguments
Destination
๏‚ง

An argument to send routines that indicates the process where a message should be del
ivered. Specified as the rank of the receiving process.

Tag
๏‚ง

Arbitrary non-negative integer assigned by the programmer to uniquely identify a messa
ge. Send and receive operations should match message tags. For a receive operation, th
e wild card MPI_ANY_TAG can be used to receive any message regardless of its tag. The
MPI standard guarantees that integers 0 โ€“ 32767 can be used as tags, but most impleme
ntations allow a much larger range than this.

Communicator
๏‚ง

123

Indicates the communication context, or set of processes for which the source or destin
ation fields are valid. Unless the programmer is explicitly creating new communicator, th
e predefined communicator MPI_COMM_WORLD is usually used.
MPI Message Passing Routine Arguments
Status
๏‚ง
๏‚ง
๏‚ง

๏‚ง

For a receive operation, indicates the source of the message and the tag of the message.
In C, this argument is a pointer to predefined structure MPI_Status (ex. stat.MPI_SOURC
E, stat.MPI_TAG).
In Fortran, it is an integer array of size MPI_STATUS_SIZE (ex. stat(MPI_SOURCE), stat(M
PI_TAG)).
Additionally, the actual number of bytes received are obtainable from Status via MPI_Get
_out routine.

Request
๏‚ง
๏‚ง
๏‚ง
๏‚ง
๏‚ง

124

Used by non-blocking send and receive operations.
Since non-blocking operations may return before the requested system buffer space is o
btained, the system issues a unique โ€œrequest numberโ€.
The programmer uses this system assigned โ€œhandleโ€ later (in a WAIT type routine) to det
ermine completion of the non-blocking operation.
In C, this argument is pointer to predefined structure MPI_Request.
In Fortran, it is an integer.
Example : Blocking Message Passing Routine (1/2)
#include "mpi.h"
#include <stdio.h>
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, dest, source, rc, count, tag=1;
char inmsg, outmsg='x';
MPI_Status Stat;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
dest = 1;
source = 1;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
}
else if (rank == 1) {
dest = 0;
source = 0;
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
}

125
Example : Blocking Message Passing Routine (2/2)
rc = MPI_Get_count(&Stat, MPI_CHAR, &count);
printf("Task %d: Received %d char(s) from task %d with tag %d n",
rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG);
MPI_Finalize();
return 0;
}

126
Example : Dead Lock
#include "mpi.h"
#include <stdio.h>
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, dest, source, rc, count, tag=1;
char inmsg, outmsg='x';
MPI_Status Stat;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
dest = 1;
source = 1;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
}
else if (rank == 1) {
dest = 0;
source = 0;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
}

127
Example : Non-Blocking Message Passing Routine (1/2)
Nearest neighbor exchange in a ring topology

#include "mpi.h"
#include <stdio.h>
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, next, prev, buf[2], tag1=1, tag2=2;
MPI_Request reqs[4];
MPI_Status stats[2];
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
prev = rank-1;
next = rank+1;
if (rank == 0) prev = numtasks - 1;
if (rank == (numtasks - 1)) next = 0;

128
Example : Non-Blocking Message Passing Routine (2/2)
MPI_Irecv(&buf[0], 1, MPI_INT, prev, tag1, MPI_COMM_WORLD, &reqs[0]);
MPI_Irecv(&buf[1], 1, MPI_INT, next, tag2, MPI_COMM_WORLD, &reqs[1]);
MPI_Isend(&rank, 1, MPI_INT, prev, tag2, MPI_COMM_WORLD, &reqs[2]);
MPI_Isend(&rank, 1, MPI_INT, next, tag1, MPI_COMM_WORLD, &reqs[3]);
{

do some work

}

MPI_Waitall(4, reqs, stats);
MPI_Finalize();
return 0;

}

129
Advanced Example : Monte-Carlo Simulation
<Problem>
๏‚ง
๏‚ง
๏‚ง

Monte carlo simulation
Random number use
PI = 4 โ…นAc/As

<Requirement>
๏‚ง
๏‚ง

Nโ€™s processor(rank) use
P2p communication

r

130
Advanced Example : Monte-Carlo Simulation for PI
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main() {
const long num_step=100000000;
long i, cnt;
double pi, x, y, r;

printf(โ€œ-----------------------------------------------------------nโ€);
pi = 0.0;
cnt = 0;
r = 0.0;
for (i=0; i<num_step; i++) {
x = rand() / (RAND_MAX+1.0);
y = rand() / (RAND_MAX+1.0);
r = sqrt(x*x + y*y);
if (r<=1) cnt += 1;
}
pi = 4.0 * (double)(cnt) / (double)(num_step);
printf(โ€œPI = %17.15lf (Error = %e)nโ€, pi, fabs(acos(-1.0)-pi));
printf(โ€œ-----------------------------------------------------------nโ€);
return 0;
}

131
Advanced Example : Numerical integration for PI
<Problem>
๏‚ง

Get PI using Numerical integration

1
0

f ( x1 )

f ( x2 )

4.0
dx =
2)
(1+x
f ( xn )

<Requirement>
๏‚ง

Point to point communication
n

4

i 1

1 2
1 ((i 0.5) )
n

1
n

....
1
n

1
(2 0.5)
n
1
(1 0.5)
n
x2

x1

132

xn

(n 0.5)

1
n
Advanced Example : Numerical integration for PI
#include <stdio.h>
#include <math.h>
int main() {
const long num_step=100000000;
long i;
double sum, step, pi, x;
step = (1.0/(double)num_step);
sum=0.0;
printf(โ€œ-----------------------------------------------------------nโ€);
for (i=0; i<num_step; i++) {
x = ((double)i - 0.5) * step;
sum += 4.0/(1.0+x*x);
}
pi = step * sum;
printf(โ€œPI = %5lf (Error = %e)nโ€, pi, fabs(acos(-1.0)-pi));
printf(โ€œ-----------------------------------------------------------nโ€);
return 0;
}

133
Type of Collective Operations
Synchronization
๏‚ง

processes wait until all members of the group have reached the synchronization point.

Data Movement
๏‚ง

broadcast, scatter/gather, all to all.

Collective Computation (reductions)
๏‚ง

134

one member of the group collects data from the other members and performs an operati
on (min, max, add, multiply, etc.) on that data.
Programming Considerations and Restrictions
With MPI-3, collective operations can be blocking or non-blocking. Only blocking operations
are covered in this tutorial.
Collective communication routines do not take message tag arguments.
Collective operations within subset of processes are accomplished by first partitioning the su
bsets into new groups and then attaching the new groups to new communicators.
Con only be used with MPI predefined datatypes โ€“ not with MPI Derived Data Types.
MPI-2 extended most collective operations to allow data movement between intercommunicat
ors (not covered here).

135
Collective Communication Routines
MPI_Barrier
๏‚ง

Synchronization operation. Creates a barrier synchronization in a group. Each task,
when reaching the MPI_Barrier call, blocks until all tasks in the group reach the same
MPI_Barrier call. Then all tasks are free to proceed.

C
MPI_Barrier(comm)

136

Fortran
MPI_BARRIER(comm, ierr)
Collective Communication Routines
MPI_Bcast
๏‚ง

Data movement operation. Broadcasts (sends) a message from the process with rank "r
oot" to all other processes in the group.

C
MPI_Bcast(&buffer, count, datatype,
root, comm)

137

Fortran
MPI_BCAST
(buffer,count,datatype,root,comm,ier
r)
Collective Communication Routines
MPI_Scatter
๏‚ง

Data movement operation. Distributes distinct messages from a single source task to ea
ch task in the group.

C

Fortran

MPI_Scatter
MPI_SCATTER
(&sendbuf,sendcnt,sendtype,&recvb (sendbuf,sendcnt,sendtype,recvbuf,
uf, recvcnt,recvtype,root,comm)
recvcnt,recvtype,root,comm,ierr)

138
Collective Communication Routines
MPI_Gather
๏‚ง

Data movement operation. Gathers distinct messages from each task in the group to a si
ngle destination task. This routine is the reverse operation of MPI_Scatter.

C

Fortran

MPI_Gather
MPI_GATHER
(&sendbuf,sendcnt,sendtype,&recvb (sendbuf,sendcnt,sendtype,recvbuf,
uf, recvcount,recvtype,root,comm)
recvcount,recvtype,root,comm,ierr)

139
Collective Communication Routines
MPI_Allgather
๏‚ง

Data movement operation. Concatenation of data to all tasks in a group. Each task in the
group, in effect, performs a one-to-all broadcasting operation within the group.

C

Fortran

MPI_Allgather
MPI_ALLGATHER
(&sendbuf,sendcount,sendtype,&rec (sendbuf,sendcount,sendtype,recvb
vbuf, recvcount,recvtype,comm)
uf, recvcount,recvtype,comm,info)

140
Collective Communication Routines
MPI_Reduce
๏‚ง

Collective computation operation. Applies a reduction operation on all tasks in the group
and places the result in one task.

C
MPI_Reduce
(&sendbuf,&recvbuf,count,datatype,
op,root,comm)

141

Fortran
MPI_REDUCE
(sendbuf,recvbuf,count,datatype,op,
root,comm,ierr)
Collective Communication Routines
The predefined MPI reduction operations appear below. Users can also define their own
reduction functions by using the MPI_Op_create routine.

MPI Reduction Operation

C Data Types

MPI_MAX

maximum

integer, float

MPI_MIN

minimum

integer, float

MPI_SUM

sum

integer, float

MPI_PROD

product

integer, float

MPI_LAND

logical AND

integer

MPI_BAND

bit-wise AND

integer, MPI_BYTE

MPI_LOR

logical OR

integer

MPI_BOR

bit-wise OR

integer, MPI_BYTE

MPI_LXOR

logical XOR

integer

MPI_BXOR

bit-wise XOR

integer, MPI_BYTE

MPI_MAXLOC

max value and location

float, double and long double

MPI_MINLOC

min value and location

float, double and long double

142
Collective Communication Routines
MPI_Allreduce
๏‚ง

Collective computation operation + data movement. Applies a reduction operation and pl
aces the result in all tasks in the group. This is equivalent to an MPI_Reduce followed by
an MPI_Bcast.

C

MPI_Allreduce
(&sendbuf,&recvbuf,count,datatype,
op,comm)

143

Fortran

MPI_ALLREDUCE
(sendbuf,recvbuf,count,datatype,op,
comm,ierr)
Collective Communication Routines
MPI_Reduce_scatter
๏‚ง

Collective computation operation + data movement. First does an element-wise reductio
n on a vector across all tasks in the group. Next, the result vector is split into disjoint se
gments and distributed across the tasks. This is equivalent to an MPI_Reduce followed b
y an MPI_Scatter operation.

C

MPI_Reduce_scatter
(&sendbuf,&recvbuf,recvcount,datat
ype, op,comm)

144

Fortran

MPI_REDUCE_SCATTER
(sendbuf,recvbuf,recvcount,datatype,
op,comm,ierr)
Collective Communication Routines
MPI_Alltoall
๏‚ง

Data movement operation. Each task in a group performs a scatter operation, sending a
distinct message to all the tasks in the group in order by index.

C

Fortran

MPI_Alltoall
MPI_ALLTOALL
(&sendbuf,sendcount,sendtype,&rec (sendbuf,sendcount,sendtype,recvb
vbuf, recvcnt,recvtype,comm)
uf, recvcnt,recvtype,comm,ierr)

145
Collective Communication Routines
MPI_Scan
๏‚ง

Performs a scan operation with respect to a reduction operation across a task group.

C
MPI_Scan
(&sendbuf,&recvbuf,count,datatype,
op,comm)

146

Fortran
MPI_SCAN
(sendbuf,recvbuf,count,datatype,op,
comm,ierr)
Collective Communication Routines
data
P0

A

A

P0

A

A

P1

B

P2

A

P2

C

P3

A

P3

D

broadcast

P1

A*B*C*D

reduce

*:some operator
P0

A

B

C

D

A

P0

A

P1

B

P1

B

P2

C

P2

C

A*B*C*D

D

P3

D

A*B*C*D

scatter

gather

P3

A*B*C*D

all
reduce

A*B*C*D

*:some operator
P0

A

A

B

C

D

P0

A

P1

B

A

B

C

D

P1

B

P2

C

A

B

C

D

P2

C

A*B*C

P3

D

A

B

C

D

P3

D

A*B*C*D

allgather

A

scan

A*B

*:some operator
P0

A0

A1

A2

A3

alltoall

A0

B0

C0

D0

P0

A0

A1

A2

A0*B0*C0*D0

A3

reduce
scatter

A1*B1*C1*D1

P1

B0

B1

B2

B3

A1

B1

C1

D1

P1

B0

B1

B2

B3

P2

C0

C1

C2

C3

A2

B2

C2

D2

P2

C0

C1

C2

C3

A2*B2*C2*D2

P3

D0

D1

D2

D3

A3

B3

C3

D3

P3

D0

D1

D2

D3

A3*B3*C3*D3

*:some operator

147
Example : Collective Communication (1/2)
Perform a scatter operation on the rows of an array
#include "mpi.h"
#include <stdio.h>
#define SIZE 4
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, sendcount, recvcount, source;
float sendbuf[SIZE][SIZE] = {
{1.0, 2.0, 3.0, 4.0},
{5.0, 6.0, 7.0, 8.0},
{9.0, 10.0, 11.0, 12.0},
{13.0, 14.0, 15.0, 16.0} };
float recvbuf[SIZE];
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

148
Example : Collective Communication (2/2)
if (numtasks == SIZE) {
source = 1;
sendcount = SIZE;
recvcount = SIZE;
MPI_Scatter(sendbuf,sendcount,MPI_FLOAT,recvbuf,recvcount,
MPI_FLOAT,source,MPI_COMM_WORLD);
printf("rank= %d Results: %f %f %f %fn",rank,recvbuf[0],
recvbuf[1],recvbuf[2],recvbuf[3]);
}
else
printf("Must specify %d processors. Terminating.n",SIZE);
MPI_Finalize();
return 0;
}

149
Advanced Example : Monte-Carlo Simulation for PI
Use the collective communication routines!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main() {
const long num_step=100000000;
long i, cnt;
double pi, x, y, r;
printf(โ€œ-----------------------------------------------------------nโ€);
pi = 0.0;
cnt = 0;
r = 0.0;
for (i=0; i<num_step; i++) {
x = rand() / (RAND_MAX+1.0);
y = rand() / (RAND_MAX+1.0);
r = sqrt(x*x + y*y);
if (r<=1) cnt += 1;
}
pi = 4.0 * (double)(cnt) / (double)(num_step);
printf(โ€œPI = %17.15lf (Error = %e)nโ€, pi, fabs(acos(-1.0)-pi));
printf(โ€œ-----------------------------------------------------------nโ€);
return 0;
}

150
Advanced Example : Numerical integration for PI
Use the collective communication routines!
#include <stdio.h>
#include <math.h>
int main() {
const long num_step=100000000;
long i;
double sum, step, pi, x;
step = (1.0/(double)num_step);
sum=0.0;
printf(โ€œ-----------------------------------------------------------nโ€);
for (i=0; i<num_step; i++) {
x = ((double)i - 0.5) * step;
sum += 4.0/(1.0+x*x);
}
pi = step * sum;
printf(โ€œPI = %5lf (Error = %e)nโ€, pi, fabs(acos(-1.0)-pi));
printf(โ€œ-----------------------------------------------------------nโ€);
return 0;
}

151
Any questions?

More Related Content

What's hot

DDoS Attack Detection and Botnet Prevention using Machine Learning
DDoS Attack Detection and Botnet Prevention using Machine LearningDDoS Attack Detection and Botnet Prevention using Machine Learning
DDoS Attack Detection and Botnet Prevention using Machine Learning
IRJET Journal
ย 
HCI Group Project Report
HCI Group Project ReportHCI Group Project Report
HCI Group Project Report
weilong1113
ย 
Ambient Intelligence
Ambient IntelligenceAmbient Intelligence
Ambient Intelligence
Dr. Mazlan Abbas
ย 
Lecture 10 intruders
Lecture 10 intrudersLecture 10 intruders
Lecture 10 intruders
rajakhurram
ย 
A deep learning facial expression recognition based scoring system for restau...
A deep learning facial expression recognition based scoring system for restau...A deep learning facial expression recognition based scoring system for restau...
A deep learning facial expression recognition based scoring system for restau...
CloudTechnologies
ย 
URI handlers
URI handlersURI handlers
URI handlers
sayaleepote
ย 
Deep Learning for Lung Cancer Detection
Deep Learning for Lung Cancer DetectionDeep Learning for Lung Cancer Detection
Deep Learning for Lung Cancer Detection
Miguel Gonzรกlez-Fierro
ย 
Routing attacks and counter measures in iot
Routing attacks and counter measures in iotRouting attacks and counter measures in iot
Routing attacks and counter measures in iot
Rishita Jaggi
ย 
E learning full report
E learning full reportE learning full report
E learning full report
Gaurav kumar rai - student
ย 
Bus Ticket Management System
Bus Ticket Management SystemBus Ticket Management System
Bus Ticket Management System
SM. Aurnob
ย 
Lung Cancer Detection using Machine Learning
Lung Cancer Detection using Machine LearningLung Cancer Detection using Machine Learning
Lung Cancer Detection using Machine Learning
ijtsrd
ย 
Car centralize
Car centralizeCar centralize
Car centralizeBhavik Panchal
ย 
SDN( Software Defined Network) and NFV(Network Function Virtualization) for I...
SDN( Software Defined Network) and NFV(Network Function Virtualization) for I...SDN( Software Defined Network) and NFV(Network Function Virtualization) for I...
SDN( Software Defined Network) and NFV(Network Function Virtualization) for I...
Sagar Rai
ย 
Software agents
Software agentsSoftware agents
Software agents
rajsandhu1989
ย 
Log Analysis
Log AnalysisLog Analysis
System Security-Chapter 1
System Security-Chapter 1System Security-Chapter 1
System Security-Chapter 1
Vamsee Krishna Kiran
ย 
Pest Control in Agricultural Plantations Using Image Processing
Pest Control in Agricultural Plantations Using Image ProcessingPest Control in Agricultural Plantations Using Image Processing
Pest Control in Agricultural Plantations Using Image Processing
IOSR Journals
ย 
Wireless and mobile security
Wireless and mobile securityWireless and mobile security
Wireless and mobile security
Pushkar Pashupat
ย 
Hostel Management System(HMS)
Hostel Management  System(HMS)Hostel Management  System(HMS)
Hostel Management System(HMS)
Omkar Walavalkar
ย 
Report on online bus management
Report on online bus managementReport on online bus management
Report on online bus management
Naeem Ahmad
ย 

What's hot (20)

DDoS Attack Detection and Botnet Prevention using Machine Learning
DDoS Attack Detection and Botnet Prevention using Machine LearningDDoS Attack Detection and Botnet Prevention using Machine Learning
DDoS Attack Detection and Botnet Prevention using Machine Learning
ย 
HCI Group Project Report
HCI Group Project ReportHCI Group Project Report
HCI Group Project Report
ย 
Ambient Intelligence
Ambient IntelligenceAmbient Intelligence
Ambient Intelligence
ย 
Lecture 10 intruders
Lecture 10 intrudersLecture 10 intruders
Lecture 10 intruders
ย 
A deep learning facial expression recognition based scoring system for restau...
A deep learning facial expression recognition based scoring system for restau...A deep learning facial expression recognition based scoring system for restau...
A deep learning facial expression recognition based scoring system for restau...
ย 
URI handlers
URI handlersURI handlers
URI handlers
ย 
Deep Learning for Lung Cancer Detection
Deep Learning for Lung Cancer DetectionDeep Learning for Lung Cancer Detection
Deep Learning for Lung Cancer Detection
ย 
Routing attacks and counter measures in iot
Routing attacks and counter measures in iotRouting attacks and counter measures in iot
Routing attacks and counter measures in iot
ย 
E learning full report
E learning full reportE learning full report
E learning full report
ย 
Bus Ticket Management System
Bus Ticket Management SystemBus Ticket Management System
Bus Ticket Management System
ย 
Lung Cancer Detection using Machine Learning
Lung Cancer Detection using Machine LearningLung Cancer Detection using Machine Learning
Lung Cancer Detection using Machine Learning
ย 
Car centralize
Car centralizeCar centralize
Car centralize
ย 
SDN( Software Defined Network) and NFV(Network Function Virtualization) for I...
SDN( Software Defined Network) and NFV(Network Function Virtualization) for I...SDN( Software Defined Network) and NFV(Network Function Virtualization) for I...
SDN( Software Defined Network) and NFV(Network Function Virtualization) for I...
ย 
Software agents
Software agentsSoftware agents
Software agents
ย 
Log Analysis
Log AnalysisLog Analysis
Log Analysis
ย 
System Security-Chapter 1
System Security-Chapter 1System Security-Chapter 1
System Security-Chapter 1
ย 
Pest Control in Agricultural Plantations Using Image Processing
Pest Control in Agricultural Plantations Using Image ProcessingPest Control in Agricultural Plantations Using Image Processing
Pest Control in Agricultural Plantations Using Image Processing
ย 
Wireless and mobile security
Wireless and mobile securityWireless and mobile security
Wireless and mobile security
ย 
Hostel Management System(HMS)
Hostel Management  System(HMS)Hostel Management  System(HMS)
Hostel Management System(HMS)
ย 
Report on online bus management
Report on online bus managementReport on online bus management
Report on online bus management
ย 

Viewers also liked

Gpu Systems
Gpu SystemsGpu Systems
Gpu Systems
jpaugh
ย 
๋ณ‘๋ ฌ์ฒ˜๋ฆฌ์™€ ์„ฑ๋Šฅํ–ฅ์ƒ
๋ณ‘๋ ฌ์ฒ˜๋ฆฌ์™€ ์„ฑ๋Šฅํ–ฅ์ƒ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ์™€ ์„ฑ๋Šฅํ–ฅ์ƒ
๋ณ‘๋ ฌ์ฒ˜๋ฆฌ์™€ ์„ฑ๋Šฅํ–ฅ์ƒshaderx
ย 
ISBI MPI Tutorial
ISBI MPI TutorialISBI MPI Tutorial
ISBI MPI TutorialDaniel Blezek
ย 
2node cluster
2node cluster2node cluster
2node clustersprdd
ย 
Introduction to Linux #1
Introduction to Linux #1Introduction to Linux #1
Introduction to Linux #1
UNIST
ย 
Using MPI
Using MPIUsing MPI
Using MPI
Kazuki Ohta
ย 
์˜คํ”ˆ์†Œ์Šค์ปจ์„คํŒ… ํด๋Ÿฌ์Šคํ„ฐ์ œ์•ˆ V1.0
์˜คํ”ˆ์†Œ์Šค์ปจ์„คํŒ… ํด๋Ÿฌ์Šคํ„ฐ์ œ์•ˆ V1.0์˜คํ”ˆ์†Œ์Šค์ปจ์„คํŒ… ํด๋Ÿฌ์Šคํ„ฐ์ œ์•ˆ V1.0
์˜คํ”ˆ์†Œ์Šค์ปจ์„คํŒ… ํด๋Ÿฌ์Šคํ„ฐ์ œ์•ˆ V1.0sprdd
ย 

Viewers also liked (8)

Gpu Systems
Gpu SystemsGpu Systems
Gpu Systems
ย 
๋ณ‘๋ ฌ์ฒ˜๋ฆฌ์™€ ์„ฑ๋Šฅํ–ฅ์ƒ
๋ณ‘๋ ฌ์ฒ˜๋ฆฌ์™€ ์„ฑ๋Šฅํ–ฅ์ƒ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ์™€ ์„ฑ๋Šฅํ–ฅ์ƒ
๋ณ‘๋ ฌ์ฒ˜๋ฆฌ์™€ ์„ฑ๋Šฅํ–ฅ์ƒ
ย 
ISBI MPI Tutorial
ISBI MPI TutorialISBI MPI Tutorial
ISBI MPI Tutorial
ย 
2node cluster
2node cluster2node cluster
2node cluster
ย 
Introduction to Linux #1
Introduction to Linux #1Introduction to Linux #1
Introduction to Linux #1
ย 
Using MPI
Using MPIUsing MPI
Using MPI
ย 
์˜คํ”ˆ์†Œ์Šค์ปจ์„คํŒ… ํด๋Ÿฌ์Šคํ„ฐ์ œ์•ˆ V1.0
์˜คํ”ˆ์†Œ์Šค์ปจ์„คํŒ… ํด๋Ÿฌ์Šคํ„ฐ์ œ์•ˆ V1.0์˜คํ”ˆ์†Œ์Šค์ปจ์„คํŒ… ํด๋Ÿฌ์Šคํ„ฐ์ œ์•ˆ V1.0
์˜คํ”ˆ์†Œ์Šค์ปจ์„คํŒ… ํด๋Ÿฌ์Šคํ„ฐ์ œ์•ˆ V1.0
ย 
Open MPI 2
Open MPI 2Open MPI 2
Open MPI 2
ย 

Similar to Introduction to Parallel Programming

Thread programming
Thread programmingThread programming
Thread programmingYoonJong Choi
ย 
ํ•˜๋‘ก ํƒ€์ž…๊ณผ ํฌ๋งท
ํ•˜๋‘ก ํƒ€์ž…๊ณผ ํฌ๋งทํ•˜๋‘ก ํƒ€์ž…๊ณผ ํฌ๋งท
ํ•˜๋‘ก ํƒ€์ž…๊ณผ ํฌ๋งท์ง„ํ˜ธ ๋ฐ•
ย 
Implementing remote procedure calls rev2
Implementing remote procedure calls rev2Implementing remote procedure calls rev2
Implementing remote procedure calls rev2
Sung-jae Park
ย 
์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ธฐ๋ฒ•
์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ธฐ๋ฒ•
์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ธฐ๋ฒ•Daniel Kim
ย 
Multithread pattern ์†Œ๊ฐœ
Multithread pattern ์†Œ๊ฐœMultithread pattern ์†Œ๊ฐœ
Multithread pattern ์†Œ๊ฐœSunghyouk Bae
ย 
Ch01 ๋„คํŠธ์›Œํฌ์™€+์†Œ์ผ“+ํ”„๋กœ๊ทธ๋ž˜๋ฐ+[ํ˜ธํ™˜+๋ชจ๋“œ]
Ch01 ๋„คํŠธ์›Œํฌ์™€+์†Œ์ผ“+ํ”„๋กœ๊ทธ๋ž˜๋ฐ+[ํ˜ธํ™˜+๋ชจ๋“œ]Ch01 ๋„คํŠธ์›Œํฌ์™€+์†Œ์ผ“+ํ”„๋กœ๊ทธ๋ž˜๋ฐ+[ํ˜ธํ™˜+๋ชจ๋“œ]
Ch01 ๋„คํŠธ์›Œํฌ์™€+์†Œ์ผ“+ํ”„๋กœ๊ทธ๋ž˜๋ฐ+[ํ˜ธํ™˜+๋ชจ๋“œ]์ง€ํ™˜ ๊น€
ย 
Programming Cascading
Programming CascadingProgramming Cascading
Programming Cascading
Taewook Eom
ย 
แ„…แ…ตแ„‚แ…ฎแ†จแ„‰แ…ณ แ„แ…ฅแ„‚แ…ฅแ†ฏ แ„€แ…ตแ„Žแ…ฉ แ„แ…ขแ„‰แ…ณแ„แ…ณแ„€แ…ชแ†ซแ„…แ…ต
แ„…แ…ตแ„‚แ…ฎแ†จแ„‰แ…ณ แ„แ…ฅแ„‚แ…ฅแ†ฏ แ„€แ…ตแ„Žแ…ฉ แ„แ…ขแ„‰แ…ณแ„แ…ณแ„€แ…ชแ†ซแ„…แ…ตแ„…แ…ตแ„‚แ…ฎแ†จแ„‰แ…ณ แ„แ…ฅแ„‚แ…ฅแ†ฏ แ„€แ…ตแ„Žแ…ฉ แ„แ…ขแ„‰แ…ณแ„แ…ณแ„€แ…ชแ†ซแ„…แ…ต
แ„…แ…ตแ„‚แ…ฎแ†จแ„‰แ…ณ แ„แ…ฅแ„‚แ…ฅแ†ฏ แ„€แ…ตแ„Žแ…ฉ แ„แ…ขแ„‰แ…ณแ„แ…ณแ„€แ…ชแ†ซแ„…แ…ต
Seungyong Lee
ย 
Linux programming study
Linux programming studyLinux programming study
Linux programming study
Yunseok Lee
ย 
Python์œผ๋กœ ์ฑ„ํŒ… ๊ตฌํ˜„ํ•˜๊ธฐ
Python์œผ๋กœ ์ฑ„ํŒ… ๊ตฌํ˜„ํ•˜๊ธฐPython์œผ๋กœ ์ฑ„ํŒ… ๊ตฌํ˜„ํ•˜๊ธฐ
Python์œผ๋กœ ์ฑ„ํŒ… ๊ตฌํ˜„ํ•˜๊ธฐ
Tae Young Lee
ย 
แ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผ
แ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผแ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผ
แ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผ
CHANG-HYUN LEE
ย 
แ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผ
แ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผแ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผ
แ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผ
CHANG-HYUN LEE
ย 
Going asynchronous with netty - SOSCON 2015
Going asynchronous with netty - SOSCON 2015Going asynchronous with netty - SOSCON 2015
Going asynchronous with netty - SOSCON 2015
Kris Jeong
ย 
Presto User & Admin Guide
Presto User & Admin GuidePresto User & Admin Guide
Presto User & Admin Guide
JEONGPHIL HAN
ย 
์„œ์šธ R&D ์บ ํผ์Šค ์ž์—ฐ์–ด ์ˆ˜์—…์ž๋ฃŒ
์„œ์šธ R&D ์บ ํผ์Šค ์ž์—ฐ์–ด ์ˆ˜์—…์ž๋ฃŒ์„œ์šธ R&D ์บ ํผ์Šค ์ž์—ฐ์–ด ์ˆ˜์—…์ž๋ฃŒ
์„œ์šธ R&D ์บ ํผ์Šค ์ž์—ฐ์–ด ์ˆ˜์—…์ž๋ฃŒ
๊น€์šฉ๋ฒ” | ๋ฌด์˜์ธํ„ฐ๋‚ด์‡ผ๋‚ 
ย 
Optimizing merge program
Optimizing merge program Optimizing merge program
Optimizing merge program
CHANG-HYUN LEE
ย 
์ด๊ธฐ์ข… ๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ๋ฅผ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด ๋ฐ ์˜์ƒ์ฒ˜๋ฆฌ ์˜คํ”ˆ์†Œ์Šค
์ด๊ธฐ์ข… ๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ๋ฅผ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด ๋ฐ ์˜์ƒ์ฒ˜๋ฆฌ ์˜คํ”ˆ์†Œ์Šค์ด๊ธฐ์ข… ๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ๋ฅผ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด ๋ฐ ์˜์ƒ์ฒ˜๋ฆฌ ์˜คํ”ˆ์†Œ์Šค
์ด๊ธฐ์ข… ๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ๋ฅผ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด ๋ฐ ์˜์ƒ์ฒ˜๋ฆฌ ์˜คํ”ˆ์†Œ์Šค
Seunghwa Song
ย 
แ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผ
แ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผแ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผ
แ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผ
CHANG-HYUN LEE
ย 
Net debugging 3_แ„Œแ…ฅแ†ซแ„’แ…กแ†ซแ„‡แ…งแ†ฏ
Net debugging 3_แ„Œแ…ฅแ†ซแ„’แ…กแ†ซแ„‡แ…งแ†ฏNet debugging 3_แ„Œแ…ฅแ†ซแ„’แ…กแ†ซแ„‡แ…งแ†ฏ
Net debugging 3_แ„Œแ…ฅแ†ซแ„’แ…กแ†ซแ„‡แ…งแ†ฏHan-Byul Jeon
ย 

Similar to Introduction to Parallel Programming (20)

Thread programming
Thread programmingThread programming
Thread programming
ย 
ํ•˜๋‘ก ํƒ€์ž…๊ณผ ํฌ๋งท
ํ•˜๋‘ก ํƒ€์ž…๊ณผ ํฌ๋งทํ•˜๋‘ก ํƒ€์ž…๊ณผ ํฌ๋งท
ํ•˜๋‘ก ํƒ€์ž…๊ณผ ํฌ๋งท
ย 
Implementing remote procedure calls rev2
Implementing remote procedure calls rev2Implementing remote procedure calls rev2
Implementing remote procedure calls rev2
ย 
์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ธฐ๋ฒ•
์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ธฐ๋ฒ•
์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ธฐ๋ฒ•
ย 
Multithread pattern ์†Œ๊ฐœ
Multithread pattern ์†Œ๊ฐœMultithread pattern ์†Œ๊ฐœ
Multithread pattern ์†Œ๊ฐœ
ย 
Ch01 ๋„คํŠธ์›Œํฌ์™€+์†Œ์ผ“+ํ”„๋กœ๊ทธ๋ž˜๋ฐ+[ํ˜ธํ™˜+๋ชจ๋“œ]
Ch01 ๋„คํŠธ์›Œํฌ์™€+์†Œ์ผ“+ํ”„๋กœ๊ทธ๋ž˜๋ฐ+[ํ˜ธํ™˜+๋ชจ๋“œ]Ch01 ๋„คํŠธ์›Œํฌ์™€+์†Œ์ผ“+ํ”„๋กœ๊ทธ๋ž˜๋ฐ+[ํ˜ธํ™˜+๋ชจ๋“œ]
Ch01 ๋„คํŠธ์›Œํฌ์™€+์†Œ์ผ“+ํ”„๋กœ๊ทธ๋ž˜๋ฐ+[ํ˜ธํ™˜+๋ชจ๋“œ]
ย 
Programming Cascading
Programming CascadingProgramming Cascading
Programming Cascading
ย 
ice_grad
ice_gradice_grad
ice_grad
ย 
แ„…แ…ตแ„‚แ…ฎแ†จแ„‰แ…ณ แ„แ…ฅแ„‚แ…ฅแ†ฏ แ„€แ…ตแ„Žแ…ฉ แ„แ…ขแ„‰แ…ณแ„แ…ณแ„€แ…ชแ†ซแ„…แ…ต
แ„…แ…ตแ„‚แ…ฎแ†จแ„‰แ…ณ แ„แ…ฅแ„‚แ…ฅแ†ฏ แ„€แ…ตแ„Žแ…ฉ แ„แ…ขแ„‰แ…ณแ„แ…ณแ„€แ…ชแ†ซแ„…แ…ตแ„…แ…ตแ„‚แ…ฎแ†จแ„‰แ…ณ แ„แ…ฅแ„‚แ…ฅแ†ฏ แ„€แ…ตแ„Žแ…ฉ แ„แ…ขแ„‰แ…ณแ„แ…ณแ„€แ…ชแ†ซแ„…แ…ต
แ„…แ…ตแ„‚แ…ฎแ†จแ„‰แ…ณ แ„แ…ฅแ„‚แ…ฅแ†ฏ แ„€แ…ตแ„Žแ…ฉ แ„แ…ขแ„‰แ…ณแ„แ…ณแ„€แ…ชแ†ซแ„…แ…ต
ย 
Linux programming study
Linux programming studyLinux programming study
Linux programming study
ย 
Python์œผ๋กœ ์ฑ„ํŒ… ๊ตฌํ˜„ํ•˜๊ธฐ
Python์œผ๋กœ ์ฑ„ํŒ… ๊ตฌํ˜„ํ•˜๊ธฐPython์œผ๋กœ ์ฑ„ํŒ… ๊ตฌํ˜„ํ•˜๊ธฐ
Python์œผ๋กœ ์ฑ„ํŒ… ๊ตฌํ˜„ํ•˜๊ธฐ
ย 
แ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผ
แ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผแ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผ
แ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผ
ย 
แ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผ
แ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผแ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผ
แ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผ
ย 
Going asynchronous with netty - SOSCON 2015
Going asynchronous with netty - SOSCON 2015Going asynchronous with netty - SOSCON 2015
Going asynchronous with netty - SOSCON 2015
ย 
Presto User & Admin Guide
Presto User & Admin GuidePresto User & Admin Guide
Presto User & Admin Guide
ย 
์„œ์šธ R&D ์บ ํผ์Šค ์ž์—ฐ์–ด ์ˆ˜์—…์ž๋ฃŒ
์„œ์šธ R&D ์บ ํผ์Šค ์ž์—ฐ์–ด ์ˆ˜์—…์ž๋ฃŒ์„œ์šธ R&D ์บ ํผ์Šค ์ž์—ฐ์–ด ์ˆ˜์—…์ž๋ฃŒ
์„œ์šธ R&D ์บ ํผ์Šค ์ž์—ฐ์–ด ์ˆ˜์—…์ž๋ฃŒ
ย 
Optimizing merge program
Optimizing merge program Optimizing merge program
Optimizing merge program
ย 
์ด๊ธฐ์ข… ๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ๋ฅผ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด ๋ฐ ์˜์ƒ์ฒ˜๋ฆฌ ์˜คํ”ˆ์†Œ์Šค
์ด๊ธฐ์ข… ๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ๋ฅผ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด ๋ฐ ์˜์ƒ์ฒ˜๋ฆฌ ์˜คํ”ˆ์†Œ์Šค์ด๊ธฐ์ข… ๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ๋ฅผ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด ๋ฐ ์˜์ƒ์ฒ˜๋ฆฌ ์˜คํ”ˆ์†Œ์Šค
์ด๊ธฐ์ข… ๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ๋ฅผ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด ๋ฐ ์˜์ƒ์ฒ˜๋ฆฌ ์˜คํ”ˆ์†Œ์Šค
ย 
แ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผ
แ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผแ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผ
แ„€แ…ฉแ„€แ…ณแ†ธแ„‰แ…ตแ„‰แ…ณแ„แ…ฆแ†ทแ„‘แ…ณแ„…แ…ฉแ„€แ…ณแ„…แ…ขแ„†แ…ตแ†ผ
ย 
Net debugging 3_แ„Œแ…ฅแ†ซแ„’แ…กแ†ซแ„‡แ…งแ†ฏ
Net debugging 3_แ„Œแ…ฅแ†ซแ„’แ…กแ†ซแ„‡แ…งแ†ฏNet debugging 3_แ„Œแ…ฅแ†ซแ„’แ…กแ†ซแ„‡แ…งแ†ฏ
Net debugging 3_แ„Œแ…ฅแ†ซแ„’แ…กแ†ซแ„‡แ…งแ†ฏ
ย 

Introduction to Parallel Programming

  • 2. CONTENT S I. Introduction to Parallel Computing II. Parallel Programming using OpenMP III. Parallel Programming using MPI
  • 3. I. Introduction to Parallel Computing
  • 4. ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ (1/3) ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ž€, ์ˆœ์ฐจ์ ์œผ๋กœ ์ง„ํ–‰๋˜๋Š” ๊ณ„์‚ฐ์˜์—ญ์„ ์—ฌ๋Ÿฌ ๊ฐœ๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ๊ฐ์„ ์—ฌ๋Ÿฌ ํ”„๋กœ์„ธ์„œ์—์„œ ๋™์‹œ์— ์ˆ˜ํ–‰ ๋˜๋„๋ก ํ•˜๋Š” ๊ฒƒ
  • 6. ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ (3/3) ์ฃผ๋œ ๋ชฉ์  : ๋”์šฑ ํฐ ๋ฌธ์ œ๋ฅผ ๋”์šฑ ๋นจ๋ฆฌ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ ๏‚ง ํ”„๋กœ๊ทธ๋žจ์˜ wall-clock time ๊ฐ์†Œ ๏‚ง ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ์˜ ํฌ๊ธฐ ์ฆ๊ฐ€ ๋ณ‘๋ ฌ ์ปดํ“จํŒ… ๊ณ„์‚ฐ ์ž์› ๏‚ง ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ(CPU)๋ฅผ ๊ฐ€์ง€๋Š” ๋‹จ์ผ ์ปดํ“จํ„ฐ ๏‚ง ๋„คํŠธ์›Œํฌ๋กœ ์—ฐ๊ฒฐ๋œ ๋‹ค์ˆ˜์˜ ์ปดํ“จํ„ฐ
  • 7. ์™œ ๋ณ‘๋ ฌ์ธ๊ฐ€? ๊ณ ์„ฑ๋Šฅ ๋‹จ์ผ ํ”„๋กœ์„ธ์„œ ์‹œ์Šคํ…œ ๊ฐœ๋ฐœ์˜ ์ œํ•œ ๏‚ง ์ „์†ก์†๋„์˜ ํ•œ๊ณ„ (๊ตฌ๋ฆฌ์„  : 9 cm/nanosec) ๏‚ง ์†Œํ˜•ํ™”์˜ ํ•œ๊ณ„ ๏‚ง ๊ฒฝ์ œ์  ์ œํ•œ ๋ณด๋‹ค ๋น ๋ฅธ ๋„คํŠธ์›Œํฌ, ๋ถ„์‚ฐ ์‹œ์Šคํ…œ, ๋‹ค์ค‘ ํ”„๋กœ์„ธ์„œ ์‹œ์Šคํ…œ ์•„ํ‚คํ…์ฒ˜์˜ ๋“ฑ์žฅ ๏ƒจ ๋ณ‘๋ ฌ ์ปดํ“จํŒ… ํ™˜๊ฒฝ ์ƒ๋Œ€์ ์œผ๋กœ ๊ฐ’์‹ผ ํ”„๋กœ์„ธ์„œ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ ๋ฌถ์–ด ๋™์‹œ์— ์‚ฌ์šฉํ•จ ์œผ๋กœ์จ ์›ํ•˜๋Š” ์„ฑ๋Šฅ์ด๋“ ๊ธฐ๋Œ€
  • 8. ํ”„๋กœ๊ทธ๋žจ๊ณผ ํ”„๋กœ์„ธ์Šค ํ”„๋กœ์„ธ์Šค๋Š” ๋ณด์กฐ ๊ธฐ์–ต ์žฅ์น˜์— ํ•˜๋‚˜์˜ ํŒŒ์ผ๋กœ์„œ ์ €์žฅ๋˜์–ด ์žˆ๋˜ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ํ”„๋กœ๊ทธ๋žจ์ด ๋กœ๋”ฉ๋˜์–ด ์šด์˜ ์ฒด์ œ(์ปค๋„)์˜ ์‹คํ–‰ ์ œ์–ด ์ƒํƒœ์— ๋†“์ธ ๊ฒƒ ๏‚ง ํ”„๋กœ๊ทธ๋žจ : ๋ณด์กฐ ๊ธฐ์–ต ์žฅ์น˜์— ์ €์žฅ ๏‚ง ํ”„๋กœ์„ธ์Šค : ์ปดํ“จํ„ฐ ์‹œ์Šคํ…œ์— ์˜ํ•˜์—ฌ ์‹คํ–‰ ์ค‘์ธ ํ”„๋กœ๊ทธ๋žจ ๏‚ง ํƒœ์Šคํฌ = ํ”„๋กœ์„ธ์Šค
  • 9. ํ”„๋กœ์„ธ์Šค ํ”„๋กœ๊ทธ๋žจ ์‹คํ–‰์„ ์œ„ํ•œ ์ž์› ํ• ๋‹น์˜ ๋‹จ์œ„๊ฐ€ ๋˜๊ณ , ํ•œ ํ”„๋กœ๊ทธ๋žจ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ ์‹คํ–‰ ๊ฐ€๋Šฅ ๋‹ค์ค‘ ํ”„๋กœ์„ธ์Šค๋ฅผ ์ง€์›ํ•˜๋Š” ๋‹จ์ผ ํ”„๋กœ์„ธ์„œ ์‹œ์Šคํ…œ ๏‚ง ์ž์› ํ• ๋‹น์˜ ๋‚ญ๋น„, ๋ฌธ๋งฅ๊ตํ™˜์œผ๋กœ ์ธํ•œ ๋ถ€ํ•˜ ๋ฐœ์ƒ ๏‚ง ๋ฌธ๋งฅ๊ตํ™˜ โ€ข ์–ด๋–ค ์ˆœ๊ฐ„ ํ•œ ํ”„๋กœ์„ธ์„œ์—์„œ ์‹คํ–‰ ์ค‘์ธ ํ”„๋กœ์„ธ์Šค๋Š” ํ•ญ์ƒ ํ•˜๋‚˜ โ€ข ํ˜„์žฌ ํ”„๋กœ์„ธ์Šค ์ƒํƒœ ์ €์žฅ ๏ƒ  ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์Šค ์ƒํƒœ ์ ์žฌ ๋ถ„์‚ฐ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ์˜ ์ž‘์—…ํ• ๋‹น ๊ธฐ์ค€
  • 10. ์Šค๋ ˆ๋“œ ํ”„๋กœ์„ธ์Šค์—์„œ ์‹คํ–‰์˜ ๊ฐœ๋…๋งŒ์„ ๋ถ„๋ฆฌํ•œ ๊ฒƒ ๏‚ง ํ”„๋กœ์„ธ์Šค = ์‹คํ–‰๋‹จ์œ„(์Šค๋ ˆ๋“œ) + ์‹คํ–‰ํ™˜๊ฒฝ(๊ณต์œ ์ž์›) ๏‚ง ํ•˜๋‚˜์˜ ํ”„๋กœ์„ธ์Šค์— ์—ฌ๋Ÿฌ ๊ฐœ ์กด์žฌ๊ฐ€๋Šฅ ๏‚ง ๊ฐ™์€ ํ”„๋กœ์„ธ์Šค์— ์†ํ•œ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ์™€ ์‹คํ–‰ํ™˜๊ฒฝ์„ ๊ณต์œ  ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ๋ฅผ ์ง€์›ํ•˜๋Š” ๋‹จ์ผ ํ”„๋กœ์„ธ์„œ ์‹œ์Šคํ…œ ๏‚ง ๋‹ค์ค‘ ํ”„๋กœ์„ธ์Šค๋ณด๋‹ค ํšจ์œจ์ ์ธ ์ž์› ํ• ๋‹น ๏‚ง ๋‹ค์ค‘ ํ”„๋กœ์„ธ์Šค๋ณด๋‹ค ํšจ์œจ์ ์ธ ๋ฌธ๋งฅ๊ตํ™˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ์˜ ์ž‘์—…ํ• ๋‹น ๊ธฐ์ค€
  • 11. ํ”„๋กœ์„ธ์Šค์™€ ์Šค๋ ˆ๋“œ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๋ฅผ ๊ฐ–๋Š” 3๊ฐœ์˜ ํ”„๋กœ์„ธ์Šค ์Šค๋ ˆ๋“œ ํ”„๋กœ์„ธ์Šค 3๊ฐœ์˜ ์Šค๋ ˆ๋“œ๋ฅผ ๊ฐ–๋Š” ํ•˜๋‚˜์˜ ํ”„๋กœ์„ธ์Šค
  • 12. ๋ณ‘๋ ฌ์„ฑ ์œ ํ˜• ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ (Data Parallelism) ๏‚ง ๋„๋ฉ”์ธ ๋ถ„ํ•ด (Domain Decomposition) ๏‚ง ๊ฐ ํƒœ์Šคํฌ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๋™์ผํ•œ ์ผ๋ จ์˜ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ ํƒœ์Šคํฌ ๋ณ‘๋ ฌ์„ฑ (Task Parallelism) ๏‚ง ๊ธฐ๋Šฅ์  ๋ถ„ํ•ด (Functional Decomposition) ๏‚ง ๊ฐ ํƒœ์Šคํฌ๋Š” ๊ฐ™๊ฑฐ๋‚˜ ๋˜๋Š” ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์„œ๋กœ ๋‹ค๋ฅธ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰
  • 13. ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ (1/3) ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ : ๋„๋ฉ”์ธ ๋ถ„ํ•ด Problem Data Set Task 1 Task 2 Task 3 Task 4
  • 14. ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ (2/3) ์ฝ”๋“œ ์˜ˆ) : ํ–‰๋ ฌ์˜ ๊ณฑ์…ˆ (OpenMP) Serial Code Parallel Code !$OMP PARALLEL DO DO K=1,N DO K=1,N DO J=1,N DO J=1,N DO I=1,N C(I,J) = C(I,J) + DO I=1,N C(I,J) = C(I,J) + (A(I,K)*B(K,J)) END DO END DO END DO A(I,K)*B(K,J) END DO END DO END DO !$OMP END PARALLEL DO
  • 15. ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ (3/3) ๋ฐ์ดํ„ฐ ๋ถ„ํ•ด (ํ”„๋กœ์„ธ์„œ 4๊ฐœ:K=1,20์ผ ๋•Œ) Process Proc0 Proc1 Proc2 Proc3 Iterations of K K = K = 1:5 6:10 K = 11:15 K = 16:20 Data Elements A(I,1:5) B(1:5,J) A(I,6:10) B(6:10,J) A(I,11:15) B(11:15,J) A(I,16:20) B(16:20,J)
  • 16. ํƒœ์Šคํฌ ๋ณ‘๋ ฌ์„ฑ (1/3) ํƒœ์Šคํฌ ๋ณ‘๋ ฌ์„ฑ : ๊ธฐ๋Šฅ์  ๋ถ„ํ•ด Problem Instruction Set Task 1 Task 2 Task 3 Task 4
  • 17. ํƒœ์Šคํฌ ๋ณ‘๋ ฌ์„ฑ (2/3) ์ฝ”๋“œ ์˜ˆ) : (OpenMP) Serial Code Parallel Code PROGRAM MAIN โ€ฆ CALL interpolate() CALL compute_stats() CALL gen_random_params() โ€ฆ END PROGRAM MAIN โ€ฆ !$OMP PARALLEL !$OMP SECTIONS CALL interpolate() !$OMP SECTION CALL compute_stats() !$OMP SECTION CALL gen_random_params() !$OMP END SECTIONS !$OMP END PARALLEL โ€ฆ END
  • 18. ํƒœ์Šคํฌ ๋ณ‘๋ ฌ์„ฑ (3/3) ํƒœ์Šคํฌ ๋ถ„ํ•ด (3๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ์—์„œ ๋™์‹œ ์ˆ˜ํ–‰) Process Code Proc0 CALL interpolate() Proc1 CALL compute_stats() Proc2 CALL gen_random_params()
  • 19. ๋ณ‘๋ ฌ ์•„ํ‚คํ…์ฒ˜ (1/2) Processor Organizations Single Instruction, Single Instruction, Single Data Stream Multiple Data Stream (SISD) (SIMD) Multiple Instruction, Multiple Instruction, Single Data Stream Multiple Data Stream (MIMD) (MISD) Uniprocessor Vector Processor Shared memory Array Processor (tightly coupled) Distributed memory (loosely coupled) Clusters Symmetric multiprocessor (SMP) Non-uniform Memory Access (NUMA)
  • 20. ๋ณ‘๋ ฌ ์•„ํ‚คํ…์ฒ˜ (2/2) ์ตœ๊ทผ์˜ ๊ณ ์„ฑ๋Šฅ ์‹œ์Šคํ…œ : ๋ถ„์‚ฐ-๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ง€์› ๏‚ง ์†Œํ”„ํŠธ ์›จ์–ด์  DSM (Distributed Shared Memory) ๊ตฌํ˜„ โ€ข ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์—์„œ ๋ฉ”์‹œ์ง€ ํŒจ์‹ฑ ์ง€์› โ€ข ๋ถ„์‚ฐ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์—์„œ ๋ณ€์ˆ˜ ๊ณต์œ  ์ง€์› ๏‚ง ํ•˜๋“œ์›จ์–ด์  DSM ๊ตฌํ˜„ : ๋ถ„์‚ฐ-๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜ โ€ข ๋ถ„์‚ฐ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์˜ ๊ฐ ๋…ธ๋“œ๋ฅผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์œผ๋กœ ๊ตฌ์„ฑ โ€ข NUMA : ์‚ฌ์šฉ์ž๋“ค์—๊ฒŒ ํ•˜๋‚˜์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜๋กœ ๋ณด์—ฌ์ง ex) Superdome(HP), Origin 3000(SGI) โ€ข SMP ํด๋Ÿฌ์Šคํ„ฐ : SMP๋กœ ๊ตฌ์„ฑ๋œ ๋ถ„์‚ฐ ์‹œ์Šคํ…œ์œผ๋กœ ๋ณด์—ฌ์ง ex) SP(IBM), Beowulf Clusters
  • 21. ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ ๊ณต์œ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ ๏‚ง ๏‚ง ๏‚ง ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜์— ์ ํ•ฉ ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ํ”„๋กœ๊ทธ๋žจ OpenMP, Pthreads ๋ฉ”์‹œ์ง€ ํŒจ์‹ฑ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ ๏‚ง ๏‚ง ๋ถ„์‚ฐ ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜์— ์ ํ•ฉ MPI, PVM ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ ๏‚ง ๏‚ง ๋ถ„์‚ฐ-๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜ OpenMP + MPI
  • 22. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ Single thread time time S1 Multi-thread Thread S1 fork P1 P2 P1 P2 P3 P3 join S2 S2 Shared address space P4 Process S2 Process P4
  • 23. ๋ฉ”์‹œ์ง€ ํŒจ์‹ฑ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ Serial time time S1 Messagepassing S1 S1 S1 S1 P1 P1 P2 P3 P4 P2 S2 S2 S2 S2 S2 S2 S2 S2 Process 0 Process 1 Process 2 Process 3 Node 1 Node 2 Node 3 Node 4 P3 P4 S2 S2 Process Data transmission over the interconnect
  • 24. ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ Message-passing P1 fork P2 time time S1 Thread S1 P3 Shared address fork P4 join join S2 S2 Thread S2 S2 Shared address Process 0 Process 1 Node 1 Node 2
  • 25. DSM ์‹œ์Šคํ…œ์˜ ๋ฉ”์‹œ์ง€ ํŒจ์‹ฑ time S1 S1 S1 S1 P1 P2 P3 P4 Message-passing S2 S2 S2 S2 S2 S2 S2 S2 Process 0 Process 1 Process 2 Process 3 Node 1 Node 2
  • 26. SPMD์™€ MPMD (1/4) SPMD(Single Program Multiple Data) ๏‚ง ํ•˜๋‚˜์˜ ํ”„๋กœ๊ทธ๋žจ์ด ์—ฌ๋Ÿฌ ํ”„๋กœ์„ธ์Šค์—์„œ ๋™์‹œ์— ์ˆ˜ํ–‰๋จ ๏‚ง ์–ด๋–ค ์ˆœ๊ฐ„ ํ”„๋กœ์„ธ์Šค๋“ค์€ ๊ฐ™์€ ํ”„๋กœ๊ทธ๋žจ๋‚ด์˜ ๋ช…๋ น์–ด๋“ค์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ ๊ทธ ๋ช…๋ น์–ด๋“ค์€ ๊ฐ™์„ ์ˆ˜๋„ ๋‹ค๋ฅผ ์ˆ˜๋„ ์žˆ์Œ MPMD (Multiple Program Multiple Data) ๏‚ง ํ•œ MPMD ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์‹คํ–‰ ํ”„๋กœ๊ทธ๋žจ์œผ๋กœ ๊ตฌ์„ฑ ๏‚ง ์‘์šฉํ”„๋กœ๊ทธ๋žจ์ด ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰๋  ๋•Œ ๊ฐ ํ”„๋กœ์„ธ์Šค๋Š” ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์Šค์™€ ๊ฐ™๊ฑฐ๋‚˜ ๋‹ค๋ฅธ ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Œ
  • 28. SPMD์™€ MPMD (3/4) MPMD : Master/Worker (Self-Scheduling) a.out Node 1 b.out Node 2 Node 3
  • 29. SPMD์™€ MPMD (4/4) MPMD: Coupled Analysis a.out b.out c.out Node 1 Node 2 Node 3
  • 30. โ€ข์„ฑ๋Šฅ์ธก์ • โ€ข์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ์š”์ธ๋“ค โ€ข๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ ์ž‘์„ฑ์ˆœ์„œ
  • 31. ํ”„๋กœ๊ทธ๋žจ ์‹คํ–‰์‹œ๊ฐ„ ์ธก์ • (1/2) time ์‚ฌ์šฉ๋ฐฉ๋ฒ•(bash, ksh) : $time [executable] $ time mpirun โ€“np 4 โ€“machinefile machines ./exmpi.x real 0m3.59s user 0m3.16s sys 0m0.04s ๏‚ง real = wall-clock time ๏‚ง User = ํ”„๋กœ๊ทธ๋žจ ์ž์‹ ๊ณผ ํ˜ธ์ถœ๋œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‹คํ–‰์— ์‚ฌ์šฉ๋œ CPU ์‹œ๊ฐ„ ๏‚ง Sys = ํ”„๋กœ๊ทธ๋žจ์— ์˜ํ•ด ์‹œ์Šคํ…œ ํ˜ธ์ถœ์— ์‚ฌ์šฉ๋œ CPU ์‹œ๊ฐ„ ๏‚ง user + sys = CPU time
  • 32. ํ”„๋กœ๊ทธ๋žจ ์‹คํ–‰์‹œ๊ฐ„ ์ธก์ • (2/2) ์‚ฌ์šฉ๋ฐฉ๋ฒ•(csh) : $time [executable] $ time testprog 1.150u 0.020s 0:01.76 66.4% 15+3981k 24+10io 0pf+0w โ‘  โ‘ก โ‘ข โ‘ฃ โ‘ค โ‘ฅ โ‘ฆ โ‘ง โ‘  user CPU time (1.15์ดˆ) โ‘ก system CPU time (0.02์ดˆ) โ‘ข real time (0๋ถ„ 1.76์ดˆ) โ‘ฃ real time์—์„œ CPU time์ด ์ฐจ์ง€ํ•˜๋Š” ์ •๋„(66.4%) โ‘ค ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ : Shared (15Kbytes) + Unshared (3981Kbytes) โ‘ฅ ์ž…๋ ฅ(24 ๋ธ”๋ก) + ์ถœ๋ ฅ(10 ๋ธ”๋ก) โ‘ฆ no page faults โ‘ง no swaps
  • 33. ์„ฑ๋Šฅ์ธก์ • ๋ณ‘๋ ฌํ™”๋ฅผ ํ†ตํ•ด ์–ป์–ด์ง„ ์„ฑ๋Šฅ์ด๋“์˜ ์ •๋Ÿ‰์  ๋ถ„์„ ์„ฑ๋Šฅ์ธก์ • ๏‚ง ์„ฑ๋Šฅํ–ฅ์ƒ๋„ ๏‚ง ํšจ์œจ ๏‚ง Cost
  • 34. ์„ฑ๋Šฅํ–ฅ์ƒ๋„ (1/7) ์„ฑ๋Šฅํ–ฅ์ƒ๋„ (Speed-up) : S(n) S(n) = ์ˆœ์ฐจ ํ”„๋กœ๊ทธ๋žจ์˜ ์‹คํ–‰์‹œ๊ฐ„ = ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ์˜ ์‹คํ–‰์‹œ๊ฐ„(n๊ฐœ ํ”„๋กœ์„ธ์„œ) ts tp ๏‚ง ์ˆœ์ฐจ ํ”„๋กœ๊ทธ๋žจ์— ๋Œ€ํ•œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ์˜ ์„ฑ๋Šฅ์ด๋“ ์ •๋„ ๏‚ง ์‹คํ–‰์‹œ๊ฐ„ = Wall-clock time ๏‚ง ์‹คํ–‰์‹œ๊ฐ„์ด 100์ดˆ๊ฐ€ ๊ฑธ๋ฆฌ๋Š” ์ˆœ์ฐจ ํ”„๋กœ๊ทธ๋žจ์„ ๋ณ‘๋ ฌํ™” ํ•˜์—ฌ 10๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ๋กœ 50์ดˆ ๋งŒ์— ์‹คํ–‰ ๋˜์—ˆ๋‹ค๋ฉด, ๏ƒจ S(10) = 100 = 50 2
  • 35. ์„ฑ๋Šฅํ–ฅ์ƒ๋„ (2/7) ์ด์ƒ(Ideal) ์„ฑ๋Šฅํ–ฅ์ƒ๋„ : Amdahlโ€Ÿs Law ๏‚ง f : ์ฝ”๋“œ์˜ ์ˆœ์ฐจ๋ถ€๋ถ„ (0 โ‰ค f โ‰ค 1) ๏‚ง tp = fts + (1-f)ts/n ์ˆœ์ฐจ๋ถ€๋ถ„ ์‹คํ–‰์‹œ ๊ฐ„ ๋ณ‘๋ ฌ๋ถ€๋ถ„ ์‹คํ–‰์‹œ ๊ฐ„
  • 36. ์„ฑ๋Šฅํ–ฅ์ƒ๋„ (3/7) ts (1 fts Serial section f )t S Parallelizable sections 1 2 n-1 n 1 2 n processes n-1 n tp (1 f )t S / n
  • 37. ์„ฑ๋Šฅํ–ฅ์ƒ๋„ (4/7) ๏‚ง S(n) = ts = tp ts fts + (1-f)ts/n 1 S(n) = ๏‚ง ์ตœ๋Œ€ ์„ฑ๋Šฅํ–ฅ์ƒ๋„ ( n ๏ƒ  โˆž ) S(n) = ๏‚ง f + (1-f)/n 1 f ํ”„๋กœ์„ธ์„œ์˜ ๊ฐœ์ˆ˜๋ฅผ ์ฆ๊ฐ€ํ•˜๋ฉด, ์ˆœ์ฐจ๋ถ€๋ถ„ ํฌ๊ธฐ์˜ ์—ญ์ˆ˜์— ์ˆ˜๋ ด
  • 38. ์„ฑ๋Šฅํ–ฅ์ƒ๋„ (5/7) f = 0.2, n = 4 Serial Parallel process 1 20 20 80 20 process 2 process 3 cannot be parallelized process 4 can be parallelized S(4) = 1 0.2 + (1-0.2)/4 = 2.5
  • 39. ์„ฑ๋Šฅํ–ฅ์ƒ๋„ (6/7) ํ”„๋กœ์„ธ์„œ ๊ฐœ์ˆ˜ ๋Œ€ ์„ฑ๋Šฅํ–ฅ์ƒ๋„ f=0 24 Speed-up 20 16 f=0.05 12 f=0.1 8 f=0.2 4 0 0 4 8 12 16 20 number of processors, n 24
  • 40. ์„ฑ๋Šฅํ–ฅ์ƒ๋„ (7/7) ์ˆœ์ฐจ๋ถ€๋ถ„ ๋Œ€ ์„ฑ๋Šฅํ–ฅ์ƒ๋„ 16 14 Speed-up 12 n=256 10 8 6 n=16 4 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Serial fraction, f 0.8 0.9 1
  • 41. ํšจ์œจ ํšจ์œจ (Efficiency) : E(n) E(n) = ๏‚ง ts = tpโ…นn S(n) n ํ”„๋กœ์„ธ์„œ ๊ฐœ์ˆ˜์— ๋”ฐ๋ฅธ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ์˜ ์„ฑ๋Šฅํšจ์œจ์„ ๋‚˜ํƒ€๋ƒ„ โ€ข 10๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ๋กœ 2๋ฐฐ์˜ ์„ฑ๋Šฅํ–ฅ์ƒ : โ€“ S(10) = 2 ๏ƒ  E(10) = 20 % โ€ข 100๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ๋กœ 10๋ฐฐ์˜ ์„ฑ๋Šฅํ–ฅ์ƒ : โ€“ S(100) = 10 ๏ƒ  E(100) = 10 %
  • 42. Cost Cost Cost = ์‹คํ–‰์‹œ๊ฐ„ โ…น ํ”„๋กœ์„ธ์„œ ๊ฐœ์ˆ˜ ๏‚ง ์ˆœ์ฐจ ํ”„๋กœ๊ทธ๋žจ : Cost = ts ๏‚ง ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ : Cost = tp โ…น n = tsn S(n) = ts E(n) ์˜ˆ) 10๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ๋กœ 2๋ฐฐ, 100๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ๋กœ 10๋ฐฐ์˜ ์„ฑ๋Šฅํ–ฅ์ƒ ts tp n S(n) E(n) Cost 100 50 10 2 0.2 500 100 10 100 10 0.1 1000
  • 43. ์‹ค์งˆ์  ์„ฑ๋Šฅํ–ฅ์ƒ์— ๊ณ ๋ คํ•  ์‚ฌํ•ญ ์‹ค์ œ ์„ฑ๋Šฅํ–ฅ์ƒ๋„ : ํ†ต์‹ ๋ถ€ํ•˜, ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ ๋ฌธ์ œ 20 80 Serial parallel 20 20 process 1 cannot be parallelized process 2 can be parallelized process 3 communication overhead process 4 Load unbalance
  • 44. ์„ฑ๋Šฅ์ฆ๊ฐ€๋ฅผ ์œ„ํ•œ ๋ฐฉ์•ˆ๋“ค 1. ํ”„๋กœ๊ทธ๋žจ์—์„œ ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅํ•œ ๋ถ€๋ถ„(Coverage) ์ฆ๊ฐ€ ๏‚ง ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ์„  2. ์ž‘์—…๋ถ€ํ•˜์˜ ๊ท ๋“ฑ ๋ถ„๋ฐฐ : ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ 3. ํ†ต์‹ ์— ์†Œ๋น„ํ•˜๋Š” ์‹œ๊ฐ„(ํ†ต์‹ ๋ถ€ํ•˜) ๊ฐ์†Œ
  • 45. ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ์š”์ธ๋“ค Coverage : Amdahlโ€™s Law ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ ๋™๊ธฐํ™” ํ†ต์‹ ๋ถ€ํ•˜ ์„ธ๋ถ„์„ฑ ์ž…์ถœ๋ ฅ
  • 46. ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ ๋ชจ๋“  ํ”„๋กœ์„ธ์Šค๋“ค์˜ ์ž‘์—…์‹œ๊ฐ„์ด ๊ฐ€๋Šฅํ•œ ๊ท ๋“ฑํ•˜๋„๋ก ์ž‘์—…์„ ๋ถ„๋ฐฐํ•˜์—ฌ ์ž‘์—…๋Œ€๊ธฐ์‹œ๊ฐ„์„ ์ตœ์†Œํ™” ํ•˜๋Š” ๊ฒƒ ๏‚ง ๋ฐ์ดํ„ฐ ๋ถ„๋ฐฐ๋ฐฉ์‹(Block, Cyclic, Block-Cyclic) ์„ ํƒ์— ์ฃผ์˜ ๏‚ง ์ด๊ธฐ์ข… ์‹œ์Šคํ…œ์„ ์—ฐ๊ฒฐ์‹œํ‚จ ๊ฒฝ์šฐ, ๋งค์šฐ ์ค‘์š”ํ•จ ๏‚ง ๋™์  ์ž‘์—…ํ• ๋‹น์„ ํ†ตํ•ด ์–ป์„ ์ˆ˜๋„ ์žˆ์Œ task0 WORK task1 WAIT task2 task3 time
  • 47. ๋™๊ธฐํ™” ๋ณ‘๋ ฌ ํƒœ์Šคํฌ์˜ ์ƒํƒœ๋‚˜ ์ •๋ณด ๋“ฑ์„ ๋™์ผํ•˜๊ฒŒ ์„ค์ •ํ•˜๊ธฐ ์œ„ํ•œ ์กฐ์ •์ž‘์—… ๏‚ง ๏‚ง ๋Œ€ํ‘œ์  ๋ณ‘๋ ฌ๋ถ€ํ•˜ : ์„ฑ๋Šฅ์— ์•…์˜ํ–ฅ ์žฅ๋ฒฝ, ์ž ๊ธˆ, ์„ธ๋งˆํฌ์–ด(semaphore), ๋™๊ธฐํ†ต์‹  ์—ฐ์‚ฐ ๋“ฑ ์ด์šฉ ๋ณ‘๋ ฌ๋ถ€ํ•˜ (Parallel Overhead) ๏‚ง ๋ณ‘๋ ฌ ํƒœ์Šคํฌ์˜ ์‹œ์ž‘, ์ข…๋ฃŒ, ์กฐ์ •์œผ๋กœ ์ธํ•œ ๋ถ€ํ•˜ โ€ข ์‹œ์ž‘ : ํƒœ์Šคํฌ ์‹๋ณ„, ํ”„๋กœ์„ธ์„œ ์ง€์ •, ํƒœ์Šคํฌ ๋กœ๋“œ, ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋“ฑ โ€ข ์ข…๋ฃŒ : ๊ฒฐ๊ณผ์˜ ์ทจํ•ฉ๊ณผ ์ „์†ก, ์šด์˜์ฒด์ œ ์ž์›์˜ ๋ฐ˜๋‚ฉ ๋“ฑ โ€ข ์กฐ์ • : ๋™๊ธฐํ™”, ํ†ต์‹  ๋“ฑ
  • 48. ํ†ต์‹ ๋ถ€ํ•˜ (1/4) ๋ฐ์ดํ„ฐ ํ†ต์‹ ์— ์˜ํ•ด ๋ฐœ์ƒํ•˜๋Š” ๋ถ€ํ•˜ ๏‚ง ๋„คํŠธ์›Œํฌ ๊ณ ์œ ์˜ ์ง€์—ฐ์‹œ๊ฐ„๊ณผ ๋Œ€์—ญํญ ์กด์žฌ ๋ฉ”์‹œ์ง€ ํŒจ์‹ฑ์—์„œ ์ค‘์š” ํ†ต์‹ ๋ถ€ํ•˜์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ์š”์ธ๋“ค ๏‚ง ๋™๊ธฐํ†ต์‹ ? ๋น„๋™๊ธฐ ํ†ต์‹ ? ๏‚ง ๋ธ”๋กํ‚น? ๋…ผ๋ธ”๋กํ‚น? ๏‚ง ์ ๋Œ€์  ํ†ต์‹ ? ์ง‘ํ•ฉํ†ต์‹ ? ๏‚ง ๋ฐ์ดํ„ฐ์ „์†ก ํšŸ์ˆ˜, ์ „์†กํ•˜๋Š” ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ
  • 49. ํ†ต์‹ ๋ถ€ํ•˜ (2/4) ํ†ต์‹ ์‹œ๊ฐ„ = ์ง€์—ฐ์‹œ๊ฐ„ + ๏‚ง ๋ฉ”์‹œ์ง€ ํฌ๊ธฐ ๋Œ€์—ญํญ ์ง€์—ฐ์‹œ๊ฐ„ : ๋ฉ”์‹œ์ง€์˜ ์ฒซ ๋น„ํŠธ๊ฐ€ ์ „์†ก๋˜๋Š”๋ฐ ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„ โ€ข ์†ก์‹ ์ง€์—ฐ + ์ˆ˜์‹ ์ง€์—ฐ + ์ „๋‹ฌ์ง€์—ฐ ๏‚ง ๋Œ€์—ญํญ : ๋‹จ์œ„์‹œ๊ฐ„๋‹น ํ†ต์‹  ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ์˜ ์–‘(MB/sec) ์œ ํšจ ๋Œ€์—ญํญ = ๋ฉ”์‹œ์ง€ ํฌ๊ธฐ = ํ†ต์‹ ์‹œ๊ฐ„ ๋Œ€์—ญํญ 1+์ง€์—ฐ์‹œ๊ฐ„โ…น๋Œ€์—ญํญ/๋ฉ”์‹œ์ง€ํฌ๊ธฐ
  • 50. ํ†ต์‹ ๋ถ€ํ•˜ (3/4) Communication time Communication Time 1/slope = Bandwidth Latency message size
  • 51. ํ†ต์‹ ๋ถ€ํ•˜ (4/4) Effective Bandwidth effective bandwidth (MB/sec) 1000 network bandwidth 100 10 1 โ€ข latency = 22 ใŽฒ โ€ข bandwidth = 133 MB/sec 0.1 0.01 1 10 100 1000 10000 100000 1000000 message size(bytes)
  • 52. ์„ธ๋ถ„์„ฑ (1/2) ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ๋‚ด์˜ ํ†ต์‹ ์‹œ๊ฐ„์— ๋Œ€ํ•œ ๊ณ„์‚ฐ์‹œ๊ฐ„์˜ ๋น„ ๏‚ง Fine-grained ๋ณ‘๋ ฌ์„ฑ โ€ข ํ†ต์‹  ๋˜๋Š” ๋™๊ธฐํ™” ์‚ฌ์ด์˜ ๊ณ„์‚ฐ์ž‘์—…์ด ์ƒ๋Œ€์ ์œผ๋กœ ์ ์Œ โ€ข ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ์— ์œ ๋ฆฌ ๏‚ง Coarse-grained ๋ณ‘๋ ฌ์„ฑ โ€ข ํ†ต์‹  ๋˜๋Š” ๋™๊ธฐํ™” ์‚ฌ์ด์˜ ๊ณ„์‚ฐ์ž‘์—…์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋งŽ์Œ โ€ข ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ์— ๋ถˆ๋ฆฌ ์ผ๋ฐ˜์ ์œผ๋กœ Coarse-grained ๋ณ‘๋ ฌ์„ฑ์ด ์„ฑ๋Šฅ๋ฉด์—์„œ ์œ ๋ฆฌ ๏‚ง ๊ณ„์‚ฐ์‹œ๊ฐ„ < ํ†ต์‹  ๋˜๋Š” ๋™๊ธฐํ™” ์‹œ๊ฐ„ ๏‚ง ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ํ•˜๋“œ์›จ์–ด ํ™˜๊ฒฝ์— ๋”ฐ๋ผ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ
  • 54. ์ž…์ถœ๋ ฅ ์ผ๋ฐ˜์ ์œผ๋กœ ๋ณ‘๋ ฌ์„ฑ์„ ๋ฐฉํ•ดํ•จ ๏‚ง ์“ฐ๊ธฐ : ๋™์ผ ํŒŒ์ผ๊ณต๊ฐ„์„ ์ด์šฉํ•  ๊ฒฝ์šฐ ๊ฒน์ณ ์“ฐ๊ธฐ ๋ฌธ์ œ ๏‚ง ์ฝ๊ธฐ : ๋‹ค์ค‘ ์ฝ๊ธฐ ์š”์ฒญ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ํŒŒ์ผ์„œ๋ฒ„์˜ ์„ฑ๋Šฅ ๋ฌธ์ œ ๏‚ง ๋„คํŠธ์›Œํฌ๋ฅผ ๊ฒฝ์œ (NFS, non-local)ํ•˜๋Š” ์ž…์ถœ๋ ฅ์˜ ๋ณ‘๋ชฉํ˜„์ƒ ์ž…์ถœ๋ ฅ์„ ๊ฐ€๋Šฅํ•˜๋ฉด ์ค„์ผ ๊ฒƒ ๏‚ง I/O ์ˆ˜ํ–‰์„ ํŠน์ • ์ˆœ์ฐจ์˜์—ญ์œผ๋กœ ์ œํ•œํ•ด ์‚ฌ์šฉ ๏‚ง ์ง€์—ญ์ ์ธ ํŒŒ์ผ๊ณต๊ฐ„์—์„œ I/O ์ˆ˜ํ–‰ ๋ณ‘๋ ฌ ํŒŒ์ผ์‹œ์Šคํ…œ์˜ ๊ฐœ๋ฐœ (GPFS, PVFS, PPFSโ€ฆ) ๋ณ‘๋ ฌ I/O ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ธํ„ฐํŽ˜์ด์Šค ๊ฐœ๋ฐœ (MPI-2 : MPI I/O)
  • 55. ํ™•์žฅ์„ฑ (1/2) ํ™•์žฅ๋œ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์ด๋“์„ ๋ˆ„๋ฆด ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ ๏‚ง ํ•˜๋“œ์›จ์–ด์  ํ™•์žฅ์„ฑ ๏‚ง ์•Œ๊ณ ๋ฆฌ์ฆ˜์  ํ™•์žฅ์„ฑ ํ™•์žฅ์„ฑ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์ฃผ์š” ํ•˜๋“œ์›จ์–ด์  ์š”์ธ ๏‚ง CPU-๋ฉ”๋ชจ๋ฆฌ ๋ฒ„์Šค ๋Œ€์—ญํญ ๏‚ง ๋„คํŠธ์›Œํฌ ๋Œ€์—ญํญ ๏‚ง ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰ ๏‚ง ํ”„๋กœ์„ธ์„œ ํด๋Ÿญ ์†๋„
  • 57. ์˜์กด์„ฑ๊ณผ ๊ต์ฐฉ ๋ฐ์ดํ„ฐ ์˜์กด์„ฑ : ํ”„๋กœ๊ทธ๋žจ์˜ ์‹คํ–‰ ์ˆœ์„œ๊ฐ€ ์‹คํ–‰ ๊ฒฐ๊ณผ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๊ฒƒ DO k = 1, 100 F(k + 2) = F(k +1) + F(k) ENDDO ๊ต์ฐฉ : ๋‘˜ ์ด์ƒ์˜ ํ”„๋กœ์„ธ์Šค๋“ค์ด ์„œ๋กœ ์ƒ๋Œ€๋ฐฉ์˜ ์ด๋ฒคํŠธ ๋ฐœ์ƒ์„ ๊ธฐ๋‹ค๋ฆฌ๋Š” ์ƒํƒœ Process 1 X = 4 SOURCE = TASK2 RECEIVE (SOURCE,Y) DEST = TASK2 SEND (DEST,X) Z = X + Y Process 2 Y = 8 SOURCE = TASK1 RECEIVE (SOURCE,X) DEST = TASK1 SEND (DEST,Y) Z = X + Y
  • 58. ์˜์กด์„ฑ F(1) F(2) F(3) F(4) F(5) F(6) F(7) โ€ฆ F(n) 1 2 3 4 5 6 7 โ€ฆ n DO k = 1, 100 F(k + 2) = F(k +1) + F(k) ENDDO Serial F(1) F(2) F(3) F(4) F(5) F(6) F(7) โ€ฆ F(n) 1 2 3 5 8 13 21 โ€ฆ โ€ฆ F(1) F(2) F(3) F(4) F(5) F(6) F(7) โ€ฆ F(n) 1 2 3 5(4) 7 11 18 โ€ฆ โ€ฆ Parallel
  • 59. ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ ์ž‘์„ฑ ์ˆœ์„œ โ‘  ์ˆœ์ฐจ์ฝ”๋“œ ์ž‘์„ฑ, ๋ถ„์„(ํ”„๋กœํŒŒ์ผ๋ง), ์ตœ์ ํ™” ๏‚ง ๏‚ง โ‘ก hotspot, ๋ณ‘๋ชฉ์ง€์ , ๋ฐ์ดํ„ฐ ์˜์กด์„ฑ ๋“ฑ์„ ํ™•์ธ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ/ํƒœ์Šคํฌ ๋ณ‘๋ ฌ์„ฑ ? ๋ณ‘๋ ฌ์ฝ”๋“œ ๊ฐœ๋ฐœ ๏‚ง MPI/OpenMP/โ€ฆ ? ๏‚ง ํƒœ์Šคํฌ ํ• ๋‹น๊ณผ ์ œ์–ด, ํ†ต์‹ , ๋™๊ธฐํ™” ์ฝ”๋“œ ์ถ”๊ฐ€ โ‘ข ์ปดํŒŒ์ผ, ์‹คํ–‰, ๋””๋ฒ„๊น… โ‘ฃ ๋ณ‘๋ ฌ์ฝ”๋“œ ์ตœ์ ํ™” ๏‚ง ์„ฑ๋Šฅ์ธก์ •๊ณผ ๋ถ„์„์„ ํ†ตํ•œ ์„ฑ๋Šฅ๊ฐœ์„ 
  • 60. ๋””๋ฒ„๊น…๊ณผ ์„ฑ๋Šฅ๋ถ„์„ ๋””๋ฒ„๊น… ๏‚ง ์ฝ”๋“œ ์ž‘์„ฑ์‹œ ๋ชจ๋“ˆํ™” ์ ‘๊ทผ ํ•„์š” ๏‚ง ํ†ต์‹ , ๋™๊ธฐํ™”, ๋ฐ์ดํ„ฐ ์˜์กด์„ฑ, ๊ต์ฐฉ ๋“ฑ์— ์ฃผ์˜ ๏‚ง ๋””๋ฒ„๊ฑฐ : TotalView ์„ฑ๋Šฅ์ธก์ •๊ณผ ๋ถ„์„ ๏‚ง timer ํ•จ์ˆ˜ ์‚ฌ์šฉ ๏‚ง ํ”„๋กœํŒŒ์ผ๋Ÿฌ : prof, gprof, pgprof, TAU
  • 62. I. Introduction to Parallel Computing
  • 63. OpenMP๋ž€ ๋ฌด์—‡์ธ๊ฐ€? ๊ณต์œ ๋ฉ”๋ชจ๋ฆฌ ํ™˜๊ฒฝ์—์„œ ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ ์ž‘์„ฑ์„ ์œ„ํ•œ ์‘์šฉํ”„๋กœ๊ทธ๋žจ ์ธํ„ฐํŽ˜์ด์Šค(API)
  • 64. OpenMP์˜ ์—ญ์‚ฌ 1990๋…„๋Œ€ : ๏‚ง ๊ณ ์„ฑ๋Šฅ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์˜ ๋ฐœ์ „ ๏‚ง ์—…์ฒด ๊ณ ์œ ์˜ ์ง€์‹œ์–ด ์ง‘ํ•ฉ ์‚ฌ์šฉ ๏ƒ  ํ‘œ์ค€ํ™”์˜ ํ•„์š”์„ฑ 1994๋…„ ANSI X3H5 ๏ƒ  1996๋…„ openmp.org ์„ค๋ฆฝ 1997๋…„ OpenMP API ๋ฐœํ‘œ Release History ๏‚ง OpenMP Fortran API ๋ฒ„์ „ 1.0 : 1997๋…„ 10์›” ๏‚ง C/C++ API ๋ฒ„์ „ 1.0 : 1998๋…„ 10์›” ๏‚ง Fortran API ๋ฒ„์ „ 1.1 : 1999๋…„ 11์›” ๏‚ง Fortran API ๋ฒ„์ „ 2.0 : 2000๋…„ 11์›” ๏‚ง C/C++ API ๋ฒ„์ „ 2.0 : 2002๋…„ 3์›” ๏‚ง Combined C/C++ and Fortran API ๋ฒ„์ „ 2.5 : 2005๋…„ 5์›” ๏‚ง API ๋ฒ„์ „ 3.0 : 2008๋…„ 5์›”
  • 65. OpenMP์˜ ๋ชฉํ‘œ ํ‘œ์ค€๊ณผ ์ด์‹์„ฑ ๊ณต์œ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํ‘œ์ค€ ๋Œ€๋ถ€๋ถ„์˜ Unix์™€ Windows์— OpenMP ์ปดํŒŒ์ผ๋Ÿฌ ์กด์žฌ Fortran, C/C++ ์ง€์›
  • 67. OpenMP์˜ ๊ตฌ์„ฑ (2/2) ์ปดํŒŒ์ผ๋Ÿฌ ์ง€์‹œ์–ด ๏‚ง ์Šค๋ ˆ๋“œ ์‚ฌ์ด์˜ ์ž‘์—…๋ถ„๋‹ด, ํ†ต์‹ , ๋™๊ธฐํ™”๋ฅผ ๋‹ด๋‹น ๏‚ง ์ข์€ ์˜๋ฏธ์˜ OpenMP ์˜ˆ) C$OMP PARALLEL DO ์‹คํ–‰์‹œ๊ฐ„ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๏‚ง ๋ณ‘๋ ฌ ๋งค๊ฐœ๋ณ€์ˆ˜(์ฐธ์—ฌ ์Šค๋ ˆ๋“œ์˜ ๊ฐœ์ˆ˜, ๋ฒˆํ˜ธ ๋“ฑ)์˜ ์„ค์ •๊ณผ ์กฐํšŒ ์˜ˆ) CALL omp_set_num_threads(128) ํ™˜๊ฒฝ๋ณ€์ˆ˜ ๏‚ง ์‹คํ–‰ ์‹œ์Šคํ…œ์˜ ๋ณ‘๋ ฌ ๋งค๊ฐœ๋ณ€์ˆ˜(์Šค๋ ˆ๋“œ ๊ฐœ์ˆ˜ ๋“ฑ)๋ฅผ ์ •์˜ ์˜ˆ) export OMP_NUM_THREADS=8
  • 68. OpenMP ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ (1/4) ์ปดํŒŒ์ผ๋Ÿฌ ์ง€์‹œ์–ด ๊ธฐ๋ฐ˜ ๏‚ง ์ˆœ์ฐจ์ฝ”๋“œ์˜ ์ ์ ˆํ•œ ์œ„์น˜์— ์ปดํŒŒ์ผ๋Ÿฌ ์ง€์‹œ์–ด ์‚ฝ์ž… ๏‚ง ์ปดํŒŒ์ผ๋Ÿฌ๊ฐ€ ์ง€์‹œ์–ด๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ์ฝ”๋“œ ์ƒ์„ฑ ๏‚ง OpenMP๋ฅผ ์ง€์›ํ•˜๋Š” ์ปดํŒŒ์ผ๋Ÿฌ ํ•„์š” ๏‚ง ๋™๊ธฐํ™”, ์˜์กด์„ฑ ์ œ๊ฑฐ ๋“ฑ์˜ ์ž‘์—… ํ•„์š”
  • 69. OpenMP ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ (2/4) Fork-Join ๏‚ง ๏‚ง ๋ณ‘๋ ฌํ™”๊ฐ€ ํ•„์š”ํ•œ ๋ถ€๋ถ„์— ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ์ƒ์„ฑ ๋ณ‘๋ ฌ๊ณ„์‚ฐ์„ ๋งˆ์น˜๋ฉด ๋‹ค์‹œ ์ˆœ์ฐจ์ ์œผ๋กœ ์‹คํ–‰ F J F J O O O O Master R I R I Thread K N K N [Parallel Region] [Parallel Region]
  • 70. OpenMP ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ (3/4) ์ปดํŒŒ์ผ๋Ÿฌ ์ง€์‹œ์–ด ์‚ฝ์ž… Serial Code PROGRAM exam โ€ฆ ialpha = 2 DO i = 1, 100 a(i) = a(i) + ialpha*b(i) ENDDO PRINT *, a END Parallel Code PROGRAM exam โ€ฆ ialpha = 2 !$OMP PARALLEL DO DO i = 1, 100 a(i) = a(i) + ialpha*b(i) ENDDO !$OMP END PARALLEL DO PRINT *, a END
  • 71. OpenMP ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ (4/4) Fork-Join โ€ป export OMP_NUM_THREADS = 4 ialpha = 2 (Master Thread) (Fork) DO i=1,25 DO i=26,50 DO i=51,75 DO i=76,100 ... ... ... ... (Join) (Master) PRINT *, a (Slave) (Master Thread) (Slave) (Slave)
  • 72. OpenMP์˜ ์žฅ์ ๊ณผ ๋‹จ์  ์žฅ ์  ๏‚– MPI๋ณด๋‹ค ์ฝ”๋”ฉ, ๋””๋ฒ„๊น…์ด ์‰ฌ์›€ ๏‚– ๋ฐ์ดํ„ฐ ๋ถ„๋ฐฐ๊ฐ€ ์ˆ˜์›” ๋‹จ ์  โ€ข ๊ณต์œ ๋ฉ”๋ชจ๋ฆฌํ™˜๊ฒฝ์˜ ๋‹ค์ค‘ ํ”„๋กœ์„ธ์„œ ์•„ํ‚คํ…์ฒ˜์—์„œ๋งŒ ๊ตฌํ˜„ ๊ฐ€๋Šฅ ๏‚– ์ ์ง„์  ๋ณ‘๋ ฌํ™”๊ฐ€ ๊ฐ€๋Šฅ โ€ข OpenMP๋ฅผ ์ง€์›ํ•˜๋Š” ์ปดํŒŒ์ผ๋Ÿฌ ํ•„์š” ๏‚– ํ•˜๋‚˜์˜ ์ฝ”๋“œ๋ฅผ ๋ณ‘๋ ฌ์ฝ”๋“œ์™€ ์ˆœ์ฐจ์ฝ” โ€ข ๋ฃจํ”„์— ๋Œ€ํ•œ ์˜์กด๋„๊ฐ€ ํผ ๏ƒ  ๋‚ฎ์€ ๋“œ๋กœ ์ปดํŒŒ์ผ ๊ฐ€๋Šฅ ๏‚– ์ƒ๋Œ€์ ์œผ๋กœ ์ฝ”๋“œ ํฌ๊ธฐ๊ฐ€ ์ž‘์Œ ๋ณ‘๋ ฌํ™” ํšจ์œจ์„ฑ โ€ข ๊ณต์œ ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜์˜ ํ™•์žฅ์„ฑ (ํ”„๋กœ์„ธ์„œ ์ˆ˜, ๋ฉ”๋ชจ๋ฆฌ ๋“ฑ) ํ•œ๊ณ„
  • 73. OpenMP์˜ ์ „ํ˜•์  ์‚ฌ์šฉ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ์„ ์ด์šฉํ•œ ๋ฃจํ”„์˜ ๋ณ‘๋ ฌํ™” 1. ์‹œ๊ฐ„์ด ๋งŽ์ด ๊ฑธ๋ฆฌ๋Š” ๋ฃจํ”„๋ฅผ ์ฐพ์Œ (ํ”„๋กœํŒŒ์ผ๋ง) 2. ์˜์กด์„ฑ, ๋ฐ์ดํ„ฐ ์œ ํšจ๋ฒ”์œ„ ์กฐ์‚ฌ 3. ์ง€์‹œ์–ด ์‚ฝ์ž…์œผ๋กœ ๋ณ‘๋ ฌํ™” ํƒœ์Šคํฌ ๋ณ‘๋ ฌ์„ฑ์„ ์ด์šฉํ•œ ๋ณ‘๋ ฌํ™”๋„ ๊ฐ€๋Šฅ
  • 74. ์ง€์‹œ์–ด (1/5) OpenMP ์ง€์‹œ์–ด ๋ฌธ๋ฒ• Fortran (๊ณ ์ •ํ˜•์‹:f77) ์ง€์‹œ์–ด ์‹œ์ž‘ (๊ฐ์‹œ๋ฌธ์ž) ์ค„ ๋ฐ”๊ฟˆ ์„ ํƒ์  ์ปดํŒŒ์ผ ์‹œ์ž‘์œ„์น˜ Fortran (์ž์œ ํ˜•์‹:f90) C โ–ช !$OMP <์ง€์‹œ์–ด> โ–ช C$OMP <์ง€์‹œ์–ด> โ–ช !$OMP <์ง€์‹œ์–ด> โ–ช #pragma omp โ–ช !$OMP <์ง€์‹œ์–ด> & โ–ช #pragma omp โ€ฆ โ–ช *$OMP <์ง€์‹œ์–ด> โ–ช !$OMP <์ง€์‹œ์–ด> !$OMP& โ€ฆ โ€ฆ โ€ฆ โ–ช !$ โ€ฆ โ–ช C$ โ€ฆ โ–ช !$ โ€ฆ โ–ช #ifdef _OPENMP โ–ช *$ โ€ฆ ์ฒซ๋ฒˆ์งธ ์—ด ๋ฌด๊ด€ ๋ฌด๊ด€
  • 75. ์ง€์‹œ์–ด (2/5) ๋ณ‘๋ ฌ์˜์—ญ ์ง€์‹œ์–ด ๏‚ง ๏‚ง ๏‚ง PARALLEL/END PARALLEL ์ฝ”๋“œ๋ถ€๋ถ„์„ ๋ณ‘๋ ฌ์˜์—ญ์œผ๋กœ ์ง€์ • ์ง€์ •๋œ ์˜์—ญ์€ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ์—์„œ ๋™์‹œ์— ์‹คํ–‰๋จ ์ž‘์—…๋ถ„ํ•  ์ง€์‹œ์–ด ๏‚ง ๏‚ง ๏‚ง DO/FOR ๋ณ‘๋ ฌ์˜์—ญ ๋‚ด์—์„œ ์‚ฌ์šฉ ๋ฃจํ”„์ธ๋ฑ์Šค๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ฐ ์Šค๋ ˆ๋“œ์—๊ฒŒ ๋ฃจํ”„์ž‘์—… ํ• ๋‹น ๊ฒฐํ•ฉ๋œ ๋ณ‘๋ ฌ ์ž‘์—…๋ถ„ํ•  ์ง€์‹œ์–ด ๏‚ง ๏‚ง PARALLEL DO/FOR PARALLEL + DO/FOR์˜ ์—ญํ• ์„ ์ˆ˜ํ–‰
  • 76. ์ง€์‹œ์–ด (3/5) ๋ณ‘๋ ฌ์˜์—ญ ์ง€์ • Fortran !$OMP PARALLEL DO i = 1, 10 PRINT *, โ€žHello Worldโ€Ÿ, i ENDDO !$OMP END PARALLEL C #pragma omp parallel for(i=1; i<=10; i++) printf(โ€œHello World %dnโ€,i);
  • 77. ์ง€์‹œ์–ด (4/5) ๋ณ‘๋ ฌ์˜์—ญ๊ณผ ์ž‘์—…๋ถ„ํ•  Fortran C !$OMP PARALLEL #pragma omp parallel !$OMP DO DO i = 1, 10 PRINT *, โ€žHello Worldโ€Ÿ, i ENDDO [!$OMP END DO] !$OMP END PARALLEL { #pragma omp for for(i=1; i<=10; i++) printf(โ€œHello World %dnโ€,i); }
  • 78. ์ง€์‹œ์–ด (5/5) ๋ณ‘๋ ฌ์˜์—ญ๊ณผ ์ž‘์—…๋ถ„ํ•  Fortran !$OMP PARALLEL !$OMP DO DO i = 1, n a(i) = b(i) + c(i) ENDDO [!$OMP END DO] Optional !$OMP DO โ€ฆ [!$OMP END DO] !$OMP END PARALLEL C #pragma omp parallel { #pragma omp for for (i=1; i<=n; i++) { a[i] = b[i] + c[i] } #pragma omp for for(โ€ฆ){ โ€ฆ } }
  • 79. ์‹คํ–‰์‹œ๊ฐ„ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์™€ ํ™˜๊ฒฝ๋ณ€์ˆ˜ (1/3) ์‹คํ–‰์‹œ๊ฐ„ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๏‚ง ๏‚ง ๏‚ง omp_set_num_threads(integer) : ์Šค๋ ˆ๋“œ ๊ฐœ์ˆ˜ ์ง€์ • omp_get_num_threads() : ์Šค๋ ˆ๋“œ ๊ฐœ์ˆ˜ ๋ฆฌํ„ด omp_get_thread_num() : ์Šค๋ ˆ๋“œ ID ๋ฆฌํ„ด ํ™˜๊ฒฝ๋ณ€์ˆ˜ ๏‚ง OMP_NUM_THREADS : ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์Šค๋ ˆ๋“œ ์ตœ๋Œ€ ๊ฐœ์ˆ˜ โ€ข export OMP_NUM_THREADS=16 (ksh) โ€ข setenv OMP_NUM_THREADS 16 (csh) C : #include <omp.h>
  • 80. ์‹คํ–‰์‹œ๊ฐ„ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์™€ ํ™˜๊ฒฝ๋ณ€์ˆ˜ (3/3) omp_set_num_threads omp_get_thread_num INTEGER OMP_GET_THREAD_NUM CALL OMP_SET_NUM_THREADS(4) Fortran !$OMP PARALLEL PRINT*, โ€ฒThread rank: โ€ฒ, OMP_GET_THREAD_NUM() !$OMP END PARALLEL #include <omp.h> omp_set_num_threads(4); C #pragma omp parallel { printf(โ€ณThread rank:%d๏ผผnโ€ณ,omp_get_thread_num()); }
  • 81. ์ฃผ์š” Clauses private(var1, var2, โ€ฆ) shared(var1, var2, โ€ฆ) default(shared|private|none) firstprivate(var1, var2, โ€ฆ) lastprivate(var1, var2, โ€ฆ) reduction(operator|intrinsic:var1, var2,โ€ฆ) schedule(type [,chunk])
  • 82. clause : reduction (1/4) reduction(operator|intrinsic:var1, var2,โ€ฆ) ๏‚ง reduction ๋ณ€์ˆ˜๋Š” shared โ€ข ๋ฐฐ์—ด ๊ฐ€๋Šฅ(Fortran only): deferred shape, assumed shape array ์‚ฌ ์šฉ ๋ถˆ๊ฐ€ โ€ข C๋Š” scalar ๋ณ€์ˆ˜๋งŒ ๊ฐ€๋Šฅ ๏‚ง ๊ฐ ์Šค๋ ˆ๋“œ์— ๋ณต์ œ๋ผ ์—ฐ์‚ฐ์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๊ฐ’์œผ๋กœ ์ดˆ๊ธฐํ™”๋˜๊ณ (ํ‘œ ์ฐธ์กฐ) ๋ณ‘๋ ฌ ์—ฐ์‚ฐ ์ˆ˜ํ–‰ ๏‚ง ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ์—์„œ ๋ณ‘๋ ฌ๋กœ ์ˆ˜ํ–‰๋œ ๊ณ„์‚ฐ๊ฒฐ๊ณผ๋ฅผ ํ™˜์‚ฐํ•ด ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๋งˆ์Šคํ„ฐ ์Šค๋ ˆ๋“œ๋กœ ๋‚ด ๋†“ ์Œ
  • 83. clause : reduction (2/4) !$OMP DO reduction(+:sum) DO i = 1, 100 sum = sum + x(i) ENDDO Thread 0 Thread 1 sum0 = 0 sum1 = 0 DO i = 1, 50 DO i = 51, 100 sum0 = sum0 + x(i) ENDDO sum = sum0 + sum1 sum1 = sum1 + x(i) ENDDO
  • 84. clause : reduction (3/4) Reduction Operators : Fortran Operator Data Types ์ดˆ๊ธฐ๊ฐ’ + integer, floating point (complex or real) 0 * integer, floating point (complex or real) 1 - integer, floating point (complex or real) 0 .AND. logical .TRUE. .OR. logical .FALSE. .EQV. logical .TRUE. .NEQV. logical .FALSE. MAX integer, floating point (real only) ๊ฐ€๋Šฅํ•œ ์ตœ์†Œ๊ฐ’ MIN integer, floating point (real only) ๊ฐ€๋Šฅํ•œ ์ตœ๋Œ€๊ฐ’ IAND integer all bits on IOR integer 0 IEOR integer 0
  • 85. clause : reduction (4/4) Reduction Operators : C Operator Data Types ์ดˆ๊ธฐ๊ฐ’ + integer, floating point 0 * integer, floating point 1 - integer, floating point 0 & integer all bits on | integer 0 ^ integer 0 && integer 1 || integer 0
  • 88. Current HPC Platforms : COTS-Based Clusters COTS = Commercial off-the-shelf Nehalem Access Control File Server(s) Gulftown โ€ฆ Login Node(s) 88 Compute Nodes
  • 89. Memory Architectures Shared Memory ๏‚ง Single address space for all processors <NUMA> <UMA> Distributed Memory 89
  • 90. What is MPI? MPI = Message Passing Interface MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a library โ€“ but rather the specification of what such a library should be. MPI primarily addresses the message-passing parallel programming model : data is moved from the address space of one process to that of another process through cooperative operations on each process. Simply stated, the goal of the message Passing Interface is to provide a widely used standard for writing message passing programs. The interface attempts to be : ๏‚ง ๏‚ง Portable ๏‚ง Efficient ๏‚ง 90 Practical Flexible
  • 91. What is MPI? The MPI standard has gone through a number of revisions, with the most recent version being MPI-3. Interface specifications have been defined for C and Fortran90 language bindings : ๏‚ง C++ bindings from MPI-1 are removed in MPI-3 ๏‚ง MPI-3 also provides support for Fortran 2003 and 2008 features Actual MPI library implementations differ in which version and features of the MPI standard they support. Developers/users will need to be aware of this. 91
  • 92. Programming Model Originally, MPI was designed for distributed memory architectures, which were becoming increasingly popular at time (1980s โ€“ early 1990s). As architecture trends changed, shared memory SMPs were combined over networks creating hybrid distributed memory/shared memory systems. 92
  • 93. Programming Model MPI implementers adapted their libraries to handle both types of underlying memory architectures seamlessly. They also adapted/developed ways of handing different interconnects and protocols. Today, MPI runs on virtually any hardware platform : ๏‚ง Distributed Memory ๏‚ง Shared Memory ๏‚ง Hybrid The programming model clearly remains a distributed memory model however, regardless of the underlying physical architecture of the machine. 93
  • 94. Reasons for Using MPI Standardization ๏‚ง MPI is the only message passing library which can be considered a standard. It is supported on virtually all HPC platforms. Practically, it has replaced all previous message passing libraries. Portability ๏‚ง There is little or no need to modify your source code when you port your application to a different platform that supports (and is compliant with) the MPI standard. Performance Opportunities ๏‚ง Vendor implementations should be able to exploit native hardware features to optimize performance. Functionality ๏‚ง There are over 440 routines defined in MPI-3, which includes the majority of those in MPI-2 and MPI-1. Availability ๏‚ง 94 A Variety of implementations are available, both vendor and public domain.
  • 95. History and Evolution MPI has resulted from the efforts of numerous individuals and groups that began in 1992. 1980s โ€“ early 1990s : Distributed memory, parallel computing develops, as do a number of incompatible soft ware tools for writing such programs โ€“ usually with tradeoffs between portability, performance, functionality and price. Recognition of the need for a standard arose. Apr 1992 : Workshop on Standards for Message Passing in a Distributed Memory Environment, Sponsored by the Center for Research on Parallel Computing, Williamsburg, Virginia. The basic features essential to a standard message passing interface were discussed, and a working group established to continue the standardization process. Preliminary draft proposal developed subsequently. 95
  • 96. History and Evolution Nov 1992 : Working group meets in Minneapolis. MPI draft proposal (MPI1) from ORNL presented. Group adopts procedures and organization to form the MPI Forum. It eventually comprised of about 175 individuals from 40 organizations including parallel computer vendors, software writers, academia and application scientists. Nov 1993 : Supercomputing 93 conference โ€“ draft MPI standard presented. May 1994 : Final version of MPI-1.0 released. MPI-1.0 was followed by versions MPI-1.1 (Jun 1995), MPI-1.2 (Jul 1997) and MPI-1.3 (May 2008). MPI-2 picked up where the first MPI specification left off, and addressed topics which went far beyond the MPI-1 specification. Was finalized in 1996. MPI-2.1 (Sep 2009), and MPI-2.2 (Sep 2009) followed. Sep 2012 : The MPI-3.0 standard was approved. 96
  • 97. History and Evolution Documentation for all versions of the MPI standard is available at : ๏‚ง 97 http://www.mpi-forum.org/docs/
  • 98. A General Structure of the MPI Program 98
  • 99. A Header File for MPI routines Required for all programs that make MPI library calls. C include file Fortran include file #include โ€œmpi.hโ€ include โ€žmpif.hโ€Ÿ With MPI-3 Fortran, the USE mpi_f80 module is preferred over using the include file shown above. 99
  • 100. The Format of MPI Calls C names are case sensitive; Fortran name are not. Programs must not declare variables or functions with names beginning with the prefix MPI_ or PMPI_ (profiling interface). C Binding Format rc = MPI_Xxxxx(parameter, โ€ฆ) Example rc = MPI_Bsend(&buf, count, type, dest, tag, comm) Error code Returned as โ€œrcโ€, MPI_SUCCESS if successful. Fortran Binding Format Example call MPI_BSEND(buf, count, type, dest, tag, comm, ierr) Error code 100 CALL MPI_XXXXX(parameter, โ€ฆ, ierr) call mpi_xxxxx(parameter, โ€ฆ, ierr) Returned as โ€œierrโ€ parameter, MPI_SUCCESS if successful.
  • 101. Communicators and Groups MPI uses objects called communicators and groups to define which collection of processes may communicate with each other. Most MPI routines require you to specify a communicator as an argument. Communicators and groups will be covered in more detail later. For now, simply use MPI_COMM_WORLD whenever a communicator is required - it is the predefined communicator that includes all of your MPI processes. 101
  • 102. Rank Within a communicator, every process has its own unique, integer identifier assigned by the system when the process initializes. A rank is sometimes also called a โ€œtask IDโ€. Ranks are contiguous and begin at zero. Used by the programmer to specify the source and destination of messages. Often used conditionally by the application to control program execution (if rank = 0 do this / if rank = 1 do that). 102
  • 103. Error Handling Most MPI routines include a return/error code parameter, as described in โ€œFormat of MPI Callsโ€ section above. However, according to the MPI standard, the default behavior of an MPI call is to abort if there is an error. This means you will probably not be able to capture a return/error code other than MPI_SUCCESS (zero). The standard does provide a means to override this default error handler. You can also consult the error handing section of the MPI Standard located at http://www.mpiforum.org/docs/mpi-11-html/node148.html . The types of errors displayed to the user are implementation dependent. 103
  • 104. Environment Management Routines MPI_Init ๏‚ง Initializes the MPI execution environment. This function must be called is every MPI program, must be called before any other MPI functions and must be called only once in an MPI program. For C programs, MPI_Init may be used to pass the command line arguments to all processes, although this is not required by the standard and is implementation dependent. C MPI_Init(&argc, &argv) ๏‚ง ๏‚ง 104 Fortran MPI_INIT(ierr) Input parameters โ€ข argc : Pointer to the number of arguments โ€ข argv : Pointer to the argument vector ierr : the error return argument
  • 105. Environment Management Routines MPI_Comm_size ๏‚ง Returns the total number of MPI processes in the specified communicator, such as MPI_COMM_WORLD. If the communicator is MPI_COMM_WORLD, then it represents the number of MPI tasks available to your application. C MPI_Comm_size(comm, &size) ๏‚ง ๏‚ง ๏‚ง 105 Fortran MPI_COMM_SIZE(comm, size, ierr) Input parameters โ€ข comm : communicator (handle) Output parameters โ€ข size : number of processes in the group of comm (integer) ierr : the error return argument
  • 106. Environment Management Routines MPI_Comm_rank ๏‚ง Returns the rank of the calling MPI process within the specified communicator. Initially, each process will be assigned a unique integer rank between 0 and number of tasks -1 within the communicator MPI_COMM_WORLD. This rank is often referred to as a task ID. If a process becomes associated with other communicators, it will have a unique rank within each of these as well. C MPI_Comm_rank(comm, &rank) ๏‚ง ๏‚ง ๏‚ง 106 Fortran MPI_COMM_SIZE(comm, rank, ierr) Input parameters โ€ข comm : communicator (handle) Output parameters โ€ข rank : rank of the calling process in the group of comm (integer) ierr : the error return argument
  • 107. Environment Management Routines MPI_Finalize ๏‚ง Terminates the MPI execution environment. This function should be the last MPI routine called in every MPI program โ€“ no other MPI routines may be called after it. C MPI_Finalize() ๏‚ง 107 ierr : the error return argument Fortran MPI_FINALIZE(ierr)
  • 108. Environment Management Routines MPI_Abort ๏‚ง Terminates all MPI processes associated with the communicator. In most MPI implementations it terminates ALL processes regardless of the communicator specified. C MPI_Abort(comm, errorcode) ๏‚ง ๏‚ง 108 Fortran MPI_ABORT(comm, errorcode, ierr) Input parameters โ€ข comm : communicator (handle) โ€ข errorcode : error code to return to invoking environment ierr : the error return argument
  • 109. Environment Management Routines MPI_Get_processor_name ๏‚ง Return the processor name. Also returns the length of the name. The buffer for โ€œnameโ€ must be at least MPI_MAX_PROCESSOR_NAME characters in size. What is returned into โ€œnameโ€ is implementation dependent โ€“ may not be the same as the output of the โ€œhostnameโ€ or โ€œhostโ€ shell commands. C Fortran MPI_Get_processor_name(&name, &resultlength) MPI_GET_PROCESSOR_NAME(n ame, resultlength, ierr) ๏‚ง ๏‚ง 109 Output parameters โ€ข name : A unique specifies for the actual (as opposed to virtual) node. This must be an array of size at least MPI_MAX_PROCESOR_NAME . โ€ข resultlen : Length (in characters) of the name. ierr : the error return argument
  • 110. Environment Management Routines MPI_Get_version ๏‚ง Returns the version (either 1 or 2) and subversion of MPI. C MPI_Get_version(&version, &subversion) ๏‚ง ๏‚ง 110 Fortran MPI_GET_VERSION(version, subversion, ierr) Output parameters โ€ข version : Major version of MPI (1 or 2) โ€ข subversion : Miner version of MPI. ierr : the error return argument
  • 111. Environment Management Routines MPI_Initialized ๏‚ง Indicates whether MPI_Init has been called โ€“ returns flag as either logical true(1) or false(0). C MPI_Initialized(&flag) ๏‚ง ๏‚ง 111 Fortran MPI_INITIALIZED(flag, ierr) Output parameters โ€ข flag : Flag is true if MPI_Init has been called and false otherwise. ierr : the error return argument
  • 112. Environment Management Routines MPI_Wtime ๏‚ง Returns an elapsed wall clock time in seconds (double precision) on the calling processor. C MPI_Wtime() ๏‚ง Fortran MPI_WTIME() Return value โ€ข Time in seconds since an arbitrary time in the past. MPI_Wtick ๏‚ง Returns the resolution in seconds (double precision) of MPI_Wtime. C MPI_Wtick() ๏‚ง 112 Fortran MPI_WTICK() Return value โ€ข Time in seconds of the resolution MPI_Wtime.
  • 113. Example: Hello world #include<stdio.h> #include"mpi.h" int main(int argc, char *argv[]) { int rc; rc = MPI_Init(&argc, &argv); printf("Hello world.n"); rc = MPI_Finalize(); return 0; } 113
  • 114. Example: Hello world Execute a mpi program. $ module load [compiler] [mpi] $ mpicc hello.c $ mpirun โ€“np 4 โ€“hostfile [hostfile] ./a.out Make out a hostfile. ibs0001 ibs0002 ibs0003 ibs0003 โ€ฆ 114 slots=2 slots=2 slots=2 slots=2
  • 115. Example : Environment Management Routine #include "mpi.hโ€ #include <stdio.h> int main(argc,argv) int argc; char *argv[]; { int numtasks, rank, len, rc; char hostname[MPI_MAX_PROCESSOR_NAME]; rc = MPI_Init(&argc,&argv); if (rc != MPI_SUCCESS) { printf ("Error starting MPI program. Terminating.n"); MPI_Abort(MPI_COMM_WORLD, rc); } MPI_Comm_size(MPI_COMM_WORLD,&numtasks); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Get_processor_name(hostname, &len); printf ("Number of tasks= %d My rank= %d Running on %sn", numtasks,rank,hostname); /******* do some work *******/ rc = MPI_Finalize(); return 0; } 115
  • 116. Types of Point-to-Point Operations MPI point-to-point operations typically involve message passing between two, and only two, different MPI tasks. One task is performing a send operation and the other task is performing a matching receive operation. There are different types of send and receive routines used for different purposes. ๏‚ง Synchronous send ๏‚ง Blocking send/blocking receive ๏‚ง Non-blocking send/non-blocking receive ๏‚ง Buffered send ๏‚ง Combined send/receive ๏‚ง โ€œReadyโ€ send Any type of send routine can be paired with any type of receive routine. MPI also provides several routines associated with send โ€“ receive operations, such as those used to wait for a messageโ€™s arrival or prove to find out if a message has arrived. 116
  • 117. Buffering In a perfect world, every send operation would be perfectly synchronized with its matching re ceive. This is rarely the case. Somehow or other, the MPI implementation must be able to deal with storing data when the two tasks are out of sync. Consider the following two cases ๏‚ง ๏‚ง 117 A send operation occurs 5 seconds before the receive is ready โ€“ where is the message w hile the receive is pending? Multiple sends arrive at the same receiving task which can only accept one send at a tim e โ€“ what happens to the messages that are โ€œbacking upโ€?
  • 118. Buffering The MPI implementation (not the MPI standard) decides what happens to data in these types of cases. Typically, a system buffer area is reserved to hold data in transit. 118
  • 119. Buffering System buffer space is : ๏‚ง ๏‚ง ๏‚ง ๏‚ง ๏‚ง 119 Opaque to the programmer and managed entirely by the MPI library A finite resource that can be easy to exhaust Often mysterious and not well documented Able to exist on the sending side, the receiving side, or both Something that may improve program performance because it allows send โ€“ receive ope rations to be asynchronous.
  • 120. Blocking vs. Non-blocking Most of the MPI point-to-point routines can be used in either blocking or non-blocking mode. Blocking ๏‚ง ๏‚ง ๏‚ง ๏‚ง A blocking send routine will only โ€œreturnโ€ after it is safe to modify the application buffer (your send data) for reuse. Safe means that modifications will not affect the data intended for the rec eive task. Safe dose not imply that the data was actually received โ€“ it may very well be sitting i n a system buffer. A blocking send can be synchronous which means there is handshaking occurring with the re ceive task to confirm a safe send. A blocking send can be asynchronous if a system buffer is used to hold the data for eventual d elivery to the receive. A blocking receive only โ€œreturnsโ€ after the data has arrived and is ready for use by the progra m. Non-blocking ๏‚ง ๏‚ง ๏‚ง ๏‚ง 120 Non-blocking send and receive routines behave similarly โ€“ they will return almost immediately. They do not wait for any communication events to complete, such as message copying from u ser memory to system buffer space or the actual arrival of message. Non-blocking operations simply โ€œrequestโ€ the MPI library to perform the operation when it is a ble. The user can not predict when it is able. The user can not predict when that will happen. It is unsafe to modify the application buffer (your variable space) until you know for a fact the r equested non-blocking operation was actually performed by the library. There are โ€œwaitโ€ routin es used to do this. Non-blocking communications are primarily used to overlap computation with communication and exploit possibale performance gains.
  • 121. MPI Message Passing Routine Arguments MPI point-to-point communication routines generally have an argument list that takes one of t he following formats : Blocking sends MPI_Send(buffer, count, type, dest, tag, comm) Non-blocking sends MPI_Isend(buffer, count, type, dest, tag, comm, request) Blocking receive MPI_Recv(buffer, count, type, source, tag, comm, status) Non-blocking receive MPI_Irecv(buffer, count, type, source, tag, comm, request) Buffer ๏‚ง Program (application) address space that references the data that is to be sent or receiv ed. In most cases, this is simply the variable name that is be sent/received. For C progra ms, this argument is passed by reference and usually must be prepended with an amper sand : &var1 Data count ๏‚ง 121 Indicates the number of data elements of a particular type to be sent.
  • 122. MPI Message Passing Routine Arguments Data type ๏‚ง For reasons of portability, MPI predefines its elementary data types. The table below lists those required by the standard. C Data Types MPI_CHAR MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_SIGNED_CHAR signed char MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE 122 signed char long double
  • 123. MPI Message Passing Routine Arguments Destination ๏‚ง An argument to send routines that indicates the process where a message should be del ivered. Specified as the rank of the receiving process. Tag ๏‚ง Arbitrary non-negative integer assigned by the programmer to uniquely identify a messa ge. Send and receive operations should match message tags. For a receive operation, th e wild card MPI_ANY_TAG can be used to receive any message regardless of its tag. The MPI standard guarantees that integers 0 โ€“ 32767 can be used as tags, but most impleme ntations allow a much larger range than this. Communicator ๏‚ง 123 Indicates the communication context, or set of processes for which the source or destin ation fields are valid. Unless the programmer is explicitly creating new communicator, th e predefined communicator MPI_COMM_WORLD is usually used.
  • 124. MPI Message Passing Routine Arguments Status ๏‚ง ๏‚ง ๏‚ง ๏‚ง For a receive operation, indicates the source of the message and the tag of the message. In C, this argument is a pointer to predefined structure MPI_Status (ex. stat.MPI_SOURC E, stat.MPI_TAG). In Fortran, it is an integer array of size MPI_STATUS_SIZE (ex. stat(MPI_SOURCE), stat(M PI_TAG)). Additionally, the actual number of bytes received are obtainable from Status via MPI_Get _out routine. Request ๏‚ง ๏‚ง ๏‚ง ๏‚ง ๏‚ง 124 Used by non-blocking send and receive operations. Since non-blocking operations may return before the requested system buffer space is o btained, the system issues a unique โ€œrequest numberโ€. The programmer uses this system assigned โ€œhandleโ€ later (in a WAIT type routine) to det ermine completion of the non-blocking operation. In C, this argument is pointer to predefined structure MPI_Request. In Fortran, it is an integer.
  • 125. Example : Blocking Message Passing Routine (1/2) #include "mpi.h" #include <stdio.h> int main(argc,argv) int argc; char *argv[]; { int numtasks, rank, dest, source, rc, count, tag=1; char inmsg, outmsg='x'; MPI_Status Stat; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { dest = 1; source = 1; rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); } else if (rank == 1) { dest = 0; source = 0; rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); } 125
  • 126. Example : Blocking Message Passing Routine (2/2) rc = MPI_Get_count(&Stat, MPI_CHAR, &count); printf("Task %d: Received %d char(s) from task %d with tag %d n", rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG); MPI_Finalize(); return 0; } 126
  • 127. Example : Dead Lock #include "mpi.h" #include <stdio.h> int main(argc,argv) int argc; char *argv[]; { int numtasks, rank, dest, source, rc, count, tag=1; char inmsg, outmsg='x'; MPI_Status Stat; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { dest = 1; source = 1; rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); } else if (rank == 1) { dest = 0; source = 0; rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); } 127
  • 128. Example : Non-Blocking Message Passing Routine (1/2) Nearest neighbor exchange in a ring topology #include "mpi.h" #include <stdio.h> int main(argc,argv) int argc; char *argv[]; { int numtasks, rank, next, prev, buf[2], tag1=1, tag2=2; MPI_Request reqs[4]; MPI_Status stats[2]; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); prev = rank-1; next = rank+1; if (rank == 0) prev = numtasks - 1; if (rank == (numtasks - 1)) next = 0; 128
  • 129. Example : Non-Blocking Message Passing Routine (2/2) MPI_Irecv(&buf[0], 1, MPI_INT, prev, tag1, MPI_COMM_WORLD, &reqs[0]); MPI_Irecv(&buf[1], 1, MPI_INT, next, tag2, MPI_COMM_WORLD, &reqs[1]); MPI_Isend(&rank, 1, MPI_INT, prev, tag2, MPI_COMM_WORLD, &reqs[2]); MPI_Isend(&rank, 1, MPI_INT, next, tag1, MPI_COMM_WORLD, &reqs[3]); { do some work } MPI_Waitall(4, reqs, stats); MPI_Finalize(); return 0; } 129
  • 130. Advanced Example : Monte-Carlo Simulation <Problem> ๏‚ง ๏‚ง ๏‚ง Monte carlo simulation Random number use PI = 4 โ…นAc/As <Requirement> ๏‚ง ๏‚ง Nโ€™s processor(rank) use P2p communication r 130
  • 131. Advanced Example : Monte-Carlo Simulation for PI #include <stdio.h> #include <stdlib.h> #include <math.h> int main() { const long num_step=100000000; long i, cnt; double pi, x, y, r; printf(โ€œ-----------------------------------------------------------nโ€); pi = 0.0; cnt = 0; r = 0.0; for (i=0; i<num_step; i++) { x = rand() / (RAND_MAX+1.0); y = rand() / (RAND_MAX+1.0); r = sqrt(x*x + y*y); if (r<=1) cnt += 1; } pi = 4.0 * (double)(cnt) / (double)(num_step); printf(โ€œPI = %17.15lf (Error = %e)nโ€, pi, fabs(acos(-1.0)-pi)); printf(โ€œ-----------------------------------------------------------nโ€); return 0; } 131
  • 132. Advanced Example : Numerical integration for PI <Problem> ๏‚ง Get PI using Numerical integration 1 0 f ( x1 ) f ( x2 ) 4.0 dx = 2) (1+x f ( xn ) <Requirement> ๏‚ง Point to point communication n 4 i 1 1 2 1 ((i 0.5) ) n 1 n .... 1 n 1 (2 0.5) n 1 (1 0.5) n x2 x1 132 xn (n 0.5) 1 n
  • 133. Advanced Example : Numerical integration for PI #include <stdio.h> #include <math.h> int main() { const long num_step=100000000; long i; double sum, step, pi, x; step = (1.0/(double)num_step); sum=0.0; printf(โ€œ-----------------------------------------------------------nโ€); for (i=0; i<num_step; i++) { x = ((double)i - 0.5) * step; sum += 4.0/(1.0+x*x); } pi = step * sum; printf(โ€œPI = %5lf (Error = %e)nโ€, pi, fabs(acos(-1.0)-pi)); printf(โ€œ-----------------------------------------------------------nโ€); return 0; } 133
  • 134. Type of Collective Operations Synchronization ๏‚ง processes wait until all members of the group have reached the synchronization point. Data Movement ๏‚ง broadcast, scatter/gather, all to all. Collective Computation (reductions) ๏‚ง 134 one member of the group collects data from the other members and performs an operati on (min, max, add, multiply, etc.) on that data.
  • 135. Programming Considerations and Restrictions With MPI-3, collective operations can be blocking or non-blocking. Only blocking operations are covered in this tutorial. Collective communication routines do not take message tag arguments. Collective operations within subset of processes are accomplished by first partitioning the su bsets into new groups and then attaching the new groups to new communicators. Con only be used with MPI predefined datatypes โ€“ not with MPI Derived Data Types. MPI-2 extended most collective operations to allow data movement between intercommunicat ors (not covered here). 135
  • 136. Collective Communication Routines MPI_Barrier ๏‚ง Synchronization operation. Creates a barrier synchronization in a group. Each task, when reaching the MPI_Barrier call, blocks until all tasks in the group reach the same MPI_Barrier call. Then all tasks are free to proceed. C MPI_Barrier(comm) 136 Fortran MPI_BARRIER(comm, ierr)
  • 137. Collective Communication Routines MPI_Bcast ๏‚ง Data movement operation. Broadcasts (sends) a message from the process with rank "r oot" to all other processes in the group. C MPI_Bcast(&buffer, count, datatype, root, comm) 137 Fortran MPI_BCAST (buffer,count,datatype,root,comm,ier r)
  • 138. Collective Communication Routines MPI_Scatter ๏‚ง Data movement operation. Distributes distinct messages from a single source task to ea ch task in the group. C Fortran MPI_Scatter MPI_SCATTER (&sendbuf,sendcnt,sendtype,&recvb (sendbuf,sendcnt,sendtype,recvbuf, uf, recvcnt,recvtype,root,comm) recvcnt,recvtype,root,comm,ierr) 138
  • 139. Collective Communication Routines MPI_Gather ๏‚ง Data movement operation. Gathers distinct messages from each task in the group to a si ngle destination task. This routine is the reverse operation of MPI_Scatter. C Fortran MPI_Gather MPI_GATHER (&sendbuf,sendcnt,sendtype,&recvb (sendbuf,sendcnt,sendtype,recvbuf, uf, recvcount,recvtype,root,comm) recvcount,recvtype,root,comm,ierr) 139
  • 140. Collective Communication Routines MPI_Allgather ๏‚ง Data movement operation. Concatenation of data to all tasks in a group. Each task in the group, in effect, performs a one-to-all broadcasting operation within the group. C Fortran MPI_Allgather MPI_ALLGATHER (&sendbuf,sendcount,sendtype,&rec (sendbuf,sendcount,sendtype,recvb vbuf, recvcount,recvtype,comm) uf, recvcount,recvtype,comm,info) 140
  • 141. Collective Communication Routines MPI_Reduce ๏‚ง Collective computation operation. Applies a reduction operation on all tasks in the group and places the result in one task. C MPI_Reduce (&sendbuf,&recvbuf,count,datatype, op,root,comm) 141 Fortran MPI_REDUCE (sendbuf,recvbuf,count,datatype,op, root,comm,ierr)
  • 142. Collective Communication Routines The predefined MPI reduction operations appear below. Users can also define their own reduction functions by using the MPI_Op_create routine. MPI Reduction Operation C Data Types MPI_MAX maximum integer, float MPI_MIN minimum integer, float MPI_SUM sum integer, float MPI_PROD product integer, float MPI_LAND logical AND integer MPI_BAND bit-wise AND integer, MPI_BYTE MPI_LOR logical OR integer MPI_BOR bit-wise OR integer, MPI_BYTE MPI_LXOR logical XOR integer MPI_BXOR bit-wise XOR integer, MPI_BYTE MPI_MAXLOC max value and location float, double and long double MPI_MINLOC min value and location float, double and long double 142
  • 143. Collective Communication Routines MPI_Allreduce ๏‚ง Collective computation operation + data movement. Applies a reduction operation and pl aces the result in all tasks in the group. This is equivalent to an MPI_Reduce followed by an MPI_Bcast. C MPI_Allreduce (&sendbuf,&recvbuf,count,datatype, op,comm) 143 Fortran MPI_ALLREDUCE (sendbuf,recvbuf,count,datatype,op, comm,ierr)
  • 144. Collective Communication Routines MPI_Reduce_scatter ๏‚ง Collective computation operation + data movement. First does an element-wise reductio n on a vector across all tasks in the group. Next, the result vector is split into disjoint se gments and distributed across the tasks. This is equivalent to an MPI_Reduce followed b y an MPI_Scatter operation. C MPI_Reduce_scatter (&sendbuf,&recvbuf,recvcount,datat ype, op,comm) 144 Fortran MPI_REDUCE_SCATTER (sendbuf,recvbuf,recvcount,datatype, op,comm,ierr)
  • 145. Collective Communication Routines MPI_Alltoall ๏‚ง Data movement operation. Each task in a group performs a scatter operation, sending a distinct message to all the tasks in the group in order by index. C Fortran MPI_Alltoall MPI_ALLTOALL (&sendbuf,sendcount,sendtype,&rec (sendbuf,sendcount,sendtype,recvb vbuf, recvcnt,recvtype,comm) uf, recvcnt,recvtype,comm,ierr) 145
  • 146. Collective Communication Routines MPI_Scan ๏‚ง Performs a scan operation with respect to a reduction operation across a task group. C MPI_Scan (&sendbuf,&recvbuf,count,datatype, op,comm) 146 Fortran MPI_SCAN (sendbuf,recvbuf,count,datatype,op, comm,ierr)
  • 147. Collective Communication Routines data P0 A A P0 A A P1 B P2 A P2 C P3 A P3 D broadcast P1 A*B*C*D reduce *:some operator P0 A B C D A P0 A P1 B P1 B P2 C P2 C A*B*C*D D P3 D A*B*C*D scatter gather P3 A*B*C*D all reduce A*B*C*D *:some operator P0 A A B C D P0 A P1 B A B C D P1 B P2 C A B C D P2 C A*B*C P3 D A B C D P3 D A*B*C*D allgather A scan A*B *:some operator P0 A0 A1 A2 A3 alltoall A0 B0 C0 D0 P0 A0 A1 A2 A0*B0*C0*D0 A3 reduce scatter A1*B1*C1*D1 P1 B0 B1 B2 B3 A1 B1 C1 D1 P1 B0 B1 B2 B3 P2 C0 C1 C2 C3 A2 B2 C2 D2 P2 C0 C1 C2 C3 A2*B2*C2*D2 P3 D0 D1 D2 D3 A3 B3 C3 D3 P3 D0 D1 D2 D3 A3*B3*C3*D3 *:some operator 147
  • 148. Example : Collective Communication (1/2) Perform a scatter operation on the rows of an array #include "mpi.h" #include <stdio.h> #define SIZE 4 int main(argc,argv) int argc; char *argv[]; { int numtasks, rank, sendcount, recvcount, source; float sendbuf[SIZE][SIZE] = { {1.0, 2.0, 3.0, 4.0}, {5.0, 6.0, 7.0, 8.0}, {9.0, 10.0, 11.0, 12.0}, {13.0, 14.0, 15.0, 16.0} }; float recvbuf[SIZE]; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); 148
  • 149. Example : Collective Communication (2/2) if (numtasks == SIZE) { source = 1; sendcount = SIZE; recvcount = SIZE; MPI_Scatter(sendbuf,sendcount,MPI_FLOAT,recvbuf,recvcount, MPI_FLOAT,source,MPI_COMM_WORLD); printf("rank= %d Results: %f %f %f %fn",rank,recvbuf[0], recvbuf[1],recvbuf[2],recvbuf[3]); } else printf("Must specify %d processors. Terminating.n",SIZE); MPI_Finalize(); return 0; } 149
  • 150. Advanced Example : Monte-Carlo Simulation for PI Use the collective communication routines! #include <stdio.h> #include <stdlib.h> #include <math.h> int main() { const long num_step=100000000; long i, cnt; double pi, x, y, r; printf(โ€œ-----------------------------------------------------------nโ€); pi = 0.0; cnt = 0; r = 0.0; for (i=0; i<num_step; i++) { x = rand() / (RAND_MAX+1.0); y = rand() / (RAND_MAX+1.0); r = sqrt(x*x + y*y); if (r<=1) cnt += 1; } pi = 4.0 * (double)(cnt) / (double)(num_step); printf(โ€œPI = %17.15lf (Error = %e)nโ€, pi, fabs(acos(-1.0)-pi)); printf(โ€œ-----------------------------------------------------------nโ€); return 0; } 150
  • 151. Advanced Example : Numerical integration for PI Use the collective communication routines! #include <stdio.h> #include <math.h> int main() { const long num_step=100000000; long i; double sum, step, pi, x; step = (1.0/(double)num_step); sum=0.0; printf(โ€œ-----------------------------------------------------------nโ€); for (i=0; i<num_step; i++) { x = ((double)i - 0.5) * step; sum += 4.0/(1.0+x*x); } pi = step * sum; printf(โ€œPI = %5lf (Error = %e)nโ€, pi, fabs(acos(-1.0)-pi)); printf(โ€œ-----------------------------------------------------------nโ€); return 0; } 151