• Save
Introduction to Parallel Programming
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Introduction to Parallel Programming

  • 666 views
Uploaded on

Parallel programming using OpenMP and MPI

Parallel programming using OpenMP and MPI

More in: Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
666
On Slideshare
625
From Embeds
41
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 41

http://hpcschool.kr 41

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 슈퍼컴퓨팅 교육 - UNIST Parallel Programming
  • 2. CONTENT S I. Introduction to Parallel Computing II. Parallel Programming using OpenMP III. Parallel Programming using MPI
  • 3. I. Introduction to Parallel Computing
  • 4. 병렬 처리 (1/3) 병렬 처리란, 순차적으로 진행되는 계산영역을 여러 개로 나누어 각각을 여러 프로세서에서 동시에 수행 되도록 하는 것
  • 5. 병렬 처리 (2/3) 순차실행 병렬실행 Inputs Outputs
  • 6. 병렬 처리 (3/3) 주된 목적 : 더욱 큰 문제를 더욱 빨리 처리하는 것  프로그램의 wall-clock time 감소  해결할 수 있는 문제의 크기 증가 병렬 컴퓨팅 계산 자원  여러 개의 프로세서(CPU)를 가지는 단일 컴퓨터  네트워크로 연결된 다수의 컴퓨터
  • 7. 왜 병렬인가? 고성능 단일 프로세서 시스템 개발의 제한  전송속도의 한계 (구리선 : 9 cm/nanosec)  소형화의 한계  경제적 제한 보다 빠른 네트워크, 분산 시스템, 다중 프로세서 시스템 아키텍처의 등장  병렬 컴퓨팅 환경 상대적으로 값싼 프로세서를 여러 개 묶어 동시에 사용함 으로써 원하는 성능이득 기대
  • 8. 프로그램과 프로세스 프로세스는 보조 기억 장치에 하나의 파일로서 저장되어 있던 실행 가능한 프로그램이 로딩되어 운영 체제(커널)의 실행 제어 상태에 놓인 것  프로그램 : 보조 기억 장치에 저장  프로세스 : 컴퓨터 시스템에 의하여 실행 중인 프로그램  태스크 = 프로세스
  • 9. 프로세스 프로그램 실행을 위한 자원 할당의 단위가 되고, 한 프로그램에서 여러 개 실행 가능 다중 프로세스를 지원하는 단일 프로세서 시스템  자원 할당의 낭비, 문맥교환으로 인한 부하 발생  문맥교환 • 어떤 순간 한 프로세서에서 실행 중인 프로세스는 항상 하나 • 현재 프로세스 상태 저장  다른 프로세스 상태 적재 분산메모리 병렬 프로그래밍 모델의 작업할당 기준
  • 10. 스레드 프로세스에서 실행의 개념만을 분리한 것  프로세스 = 실행단위(스레드) + 실행환경(공유자원)  하나의 프로세스에 여러 개 존재가능  같은 프로세스에 속한 다른 스레드와 실행환경을 공유 다중 스레드를 지원하는 단일 프로세서 시스템  다중 프로세스보다 효율적인 자원 할당  다중 프로세스보다 효율적인 문맥교환 공유 메모리 병렬 프로그래밍 모델의 작업할당 기준
  • 11. 프로세스와 스레드 하나의 스레드를 갖는 3개의 프로세스 스레드 프로세스 3개의 스레드를 갖는 하나의 프로세스
  • 12. 병렬성 유형 데이터 병렬성 (Data Parallelism)  도메인 분해 (Domain Decomposition)  각 태스크는 서로 다른 데이터를 가지고 동일한 일련의 계산을 수행 태스크 병렬성 (Task Parallelism)  기능적 분해 (Functional Decomposition)  각 태스크는 같거나 또는 다른 데이터를 가지고 서로 다른 계산을 수행
  • 13. 데이터 병렬성 (1/3) 데이터 병렬성 : 도메인 분해 Problem Data Set Task 1 Task 2 Task 3 Task 4
  • 14. 데이터 병렬성 (2/3) 코드 예) : 행렬의 곱셈 (OpenMP) Serial Code Parallel Code !$OMP PARALLEL DO DO K=1,N DO K=1,N DO J=1,N DO J=1,N DO I=1,N C(I,J) = C(I,J) + DO I=1,N C(I,J) = C(I,J) + (A(I,K)*B(K,J)) END DO END DO END DO A(I,K)*B(K,J) END DO END DO END DO !$OMP END PARALLEL DO
  • 15. 데이터 병렬성 (3/3) 데이터 분해 (프로세서 4개:K=1,20일 때) Process Proc0 Proc1 Proc2 Proc3 Iterations of K K = K = 1:5 6:10 K = 11:15 K = 16:20 Data Elements A(I,1:5) B(1:5,J) A(I,6:10) B(6:10,J) A(I,11:15) B(11:15,J) A(I,16:20) B(16:20,J)
  • 16. 태스크 병렬성 (1/3) 태스크 병렬성 : 기능적 분해 Problem Instruction Set Task 1 Task 2 Task 3 Task 4
  • 17. 태스크 병렬성 (2/3) 코드 예) : (OpenMP) Serial Code Parallel Code PROGRAM MAIN … CALL interpolate() CALL compute_stats() CALL gen_random_params() … END PROGRAM MAIN … !$OMP PARALLEL !$OMP SECTIONS CALL interpolate() !$OMP SECTION CALL compute_stats() !$OMP SECTION CALL gen_random_params() !$OMP END SECTIONS !$OMP END PARALLEL … END
  • 18. 태스크 병렬성 (3/3) 태스크 분해 (3개의 프로세서에서 동시 수행) Process Code Proc0 CALL interpolate() Proc1 CALL compute_stats() Proc2 CALL gen_random_params()
  • 19. 병렬 아키텍처 (1/2) Processor Organizations Single Instruction, Single Instruction, Single Data Stream Multiple Data Stream (SISD) (SIMD) Multiple Instruction, Multiple Instruction, Single Data Stream Multiple Data Stream (MIMD) (MISD) Uniprocessor Vector Processor Shared memory Array Processor (tightly coupled) Distributed memory (loosely coupled) Clusters Symmetric multiprocessor (SMP) Non-uniform Memory Access (NUMA)
  • 20. 병렬 아키텍처 (2/2) 최근의 고성능 시스템 : 분산-공유 메모리 지원  소프트 웨어적 DSM (Distributed Shared Memory) 구현 • 공유 메모리 시스템에서 메시지 패싱 지원 • 분산 메모리 시스템에서 변수 공유 지원  하드웨어적 DSM 구현 : 분산-공유 메모리 아키텍처 • 분산 메모리 시스템의 각 노드를 공유 메모리 시스템으로 구성 • NUMA : 사용자들에게 하나의 공유 메모리 아키텍처로 보여짐 ex) Superdome(HP), Origin 3000(SGI) • SMP 클러스터 : SMP로 구성된 분산 시스템으로 보여짐 ex) SP(IBM), Beowulf Clusters
  • 21. 병렬 프로그래밍 모델 공유메모리 병렬 프로그래밍 모델    공유 메모리 아키텍처에 적합 다중 스레드 프로그램 OpenMP, Pthreads 메시지 패싱 병렬 프로그래밍 모델   분산 메모리 아키텍처에 적합 MPI, PVM 하이브리드 병렬 프로그래밍 모델   분산-공유 메모리 아키텍처 OpenMP + MPI
  • 22. 공유 메모리 병렬 프로그래밍 모델 Single thread time time S1 Multi-thread Thread S1 fork P1 P2 P1 P2 P3 P3 join S2 S2 Shared address space P4 Process S2 Process P4
  • 23. 메시지 패싱 병렬 프로그래밍 모델 Serial time time S1 Messagepassing S1 S1 S1 S1 P1 P1 P2 P3 P4 P2 S2 S2 S2 S2 S2 S2 S2 S2 Process 0 Process 1 Process 2 Process 3 Node 1 Node 2 Node 3 Node 4 P3 P4 S2 S2 Process Data transmission over the interconnect
  • 24. 하이브리드 병렬 프로그래밍 모델 Message-passing P1 fork P2 time time S1 Thread S1 P3 Shared address fork P4 join join S2 S2 Thread S2 S2 Shared address Process 0 Process 1 Node 1 Node 2
  • 25. DSM 시스템의 메시지 패싱 time S1 S1 S1 S1 P1 P2 P3 P4 Message-passing S2 S2 S2 S2 S2 S2 S2 S2 Process 0 Process 1 Process 2 Process 3 Node 1 Node 2
  • 26. SPMD와 MPMD (1/4) SPMD(Single Program Multiple Data)  하나의 프로그램이 여러 프로세스에서 동시에 수행됨  어떤 순간 프로세스들은 같은 프로그램내의 명령어들을 수행하며 그 명령어들은 같을 수도 다를 수도 있음 MPMD (Multiple Program Multiple Data)  한 MPMD 응용 프로그램은 여러 개의 실행 프로그램으로 구성  응용프로그램이 병렬로 실행될 때 각 프로세스는 다른 프로세스와 같거나 다른 프로그램을 실행할 수 있음
  • 27. SPMD와 MPMD (2/4) SPMD a.out Node 1 Node 2 Node 3
  • 28. SPMD와 MPMD (3/4) MPMD : Master/Worker (Self-Scheduling) a.out Node 1 b.out Node 2 Node 3
  • 29. SPMD와 MPMD (4/4) MPMD: Coupled Analysis a.out b.out c.out Node 1 Node 2 Node 3
  • 30. •성능측정 •성능에 영향을 주는 요인들 •병렬 프로그램 작성순서
  • 31. 프로그램 실행시간 측정 (1/2) time 사용방법(bash, ksh) : $time [executable] $ time mpirun –np 4 –machinefile machines ./exmpi.x real 0m3.59s user 0m3.16s sys 0m0.04s  real = wall-clock time  User = 프로그램 자신과 호출된 라이브러리 실행에 사용된 CPU 시간  Sys = 프로그램에 의해 시스템 호출에 사용된 CPU 시간  user + sys = CPU time
  • 32. 프로그램 실행시간 측정 (2/2) 사용방법(csh) : $time [executable] $ time testprog 1.150u 0.020s 0:01.76 66.4% 15+3981k 24+10io 0pf+0w ① ② ③ ④ ⑤ ⑥ ⑦ ⑧ ① user CPU time (1.15초) ② system CPU time (0.02초) ③ real time (0분 1.76초) ④ real time에서 CPU time이 차지하는 정도(66.4%) ⑤ 메모리 사용 : Shared (15Kbytes) + Unshared (3981Kbytes) ⑥ 입력(24 블록) + 출력(10 블록) ⑦ no page faults ⑧ no swaps
  • 33. 성능측정 병렬화를 통해 얻어진 성능이득의 정량적 분석 성능측정  성능향상도  효율  Cost
  • 34. 성능향상도 (1/7) 성능향상도 (Speed-up) : S(n) S(n) = 순차 프로그램의 실행시간 = 병렬 프로그램의 실행시간(n개 프로세서) ts tp  순차 프로그램에 대한 병렬 프로그램의 성능이득 정도  실행시간 = Wall-clock time  실행시간이 100초가 걸리는 순차 프로그램을 병렬화 하여 10개의 프로세서로 50초 만에 실행 되었다면,  S(10) = 100 = 50 2
  • 35. 성능향상도 (2/7) 이상(Ideal) 성능향상도 : Amdahl‟s Law  f : 코드의 순차부분 (0 ≤ f ≤ 1)  tp = fts + (1-f)ts/n 순차부분 실행시 간 병렬부분 실행시 간
  • 36. 성능향상도 (3/7) ts (1 fts Serial section f )t S Parallelizable sections 1 2 n-1 n 1 2 n processes n-1 n tp (1 f )t S / n
  • 37. 성능향상도 (4/7)  S(n) = ts = tp ts fts + (1-f)ts/n 1 S(n) =  최대 성능향상도 ( n  ∞ ) S(n) =  f + (1-f)/n 1 f 프로세서의 개수를 증가하면, 순차부분 크기의 역수에 수렴
  • 38. 성능향상도 (5/7) f = 0.2, n = 4 Serial Parallel process 1 20 20 80 20 process 2 process 3 cannot be parallelized process 4 can be parallelized S(4) = 1 0.2 + (1-0.2)/4 = 2.5
  • 39. 성능향상도 (6/7) 프로세서 개수 대 성능향상도 f=0 24 Speed-up 20 16 f=0.05 12 f=0.1 8 f=0.2 4 0 0 4 8 12 16 20 number of processors, n 24
  • 40. 성능향상도 (7/7) 순차부분 대 성능향상도 16 14 Speed-up 12 n=256 10 8 6 n=16 4 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Serial fraction, f 0.8 0.9 1
  • 41. 효율 효율 (Efficiency) : E(n) E(n) =  ts = tpⅹn S(n) n 프로세서 개수에 따른 병렬 프로그램의 성능효율을 나타냄 • 10개의 프로세서로 2배의 성능향상 : – S(10) = 2  E(10) = 20 % • 100개의 프로세서로 10배의 성능향상 : – S(100) = 10  E(100) = 10 %
  • 42. Cost Cost Cost = 실행시간 ⅹ 프로세서 개수  순차 프로그램 : Cost = ts  병렬 프로그램 : Cost = tp ⅹ n = tsn S(n) = ts E(n) 예) 10개의 프로세서로 2배, 100개의 프로세서로 10배의 성능향상 ts tp n S(n) E(n) Cost 100 50 10 2 0.2 500 100 10 100 10 0.1 1000
  • 43. 실질적 성능향상에 고려할 사항 실제 성능향상도 : 통신부하, 로드 밸런싱 문제 20 80 Serial parallel 20 20 process 1 cannot be parallelized process 2 can be parallelized process 3 communication overhead process 4 Load unbalance
  • 44. 성능증가를 위한 방안들 1. 프로그램에서 병렬화 가능한 부분(Coverage) 증가  알고리즘 개선 2. 작업부하의 균등 분배 : 로드 밸런싱 3. 통신에 소비하는 시간(통신부하) 감소
  • 45. 성능에 영향을 주는 요인들 Coverage : Amdahl’s Law 로드 밸런싱 동기화 통신부하 세분성 입출력
  • 46. 로드 밸런싱 모든 프로세스들의 작업시간이 가능한 균등하도록 작업을 분배하여 작업대기시간을 최소화 하는 것  데이터 분배방식(Block, Cyclic, Block-Cyclic) 선택에 주의  이기종 시스템을 연결시킨 경우, 매우 중요함  동적 작업할당을 통해 얻을 수도 있음 task0 WORK task1 WAIT task2 task3 time
  • 47. 동기화 병렬 태스크의 상태나 정보 등을 동일하게 설정하기 위한 조정작업   대표적 병렬부하 : 성능에 악영향 장벽, 잠금, 세마포어(semaphore), 동기통신 연산 등 이용 병렬부하 (Parallel Overhead)  병렬 태스크의 시작, 종료, 조정으로 인한 부하 • 시작 : 태스크 식별, 프로세서 지정, 태스크 로드, 데이터 로드 등 • 종료 : 결과의 취합과 전송, 운영체제 자원의 반납 등 • 조정 : 동기화, 통신 등
  • 48. 통신부하 (1/4) 데이터 통신에 의해 발생하는 부하  네트워크 고유의 지연시간과 대역폭 존재 메시지 패싱에서 중요 통신부하에 영향을 주는 요인들  동기통신? 비동기 통신?  블록킹? 논블록킹?  점대점 통신? 집합통신?  데이터전송 횟수, 전송하는 데이터의 크기
  • 49. 통신부하 (2/4) 통신시간 = 지연시간 +  메시지 크기 대역폭 지연시간 : 메시지의 첫 비트가 전송되는데 걸리는 시간 • 송신지연 + 수신지연 + 전달지연  대역폭 : 단위시간당 통신 가능한 데이터의 양(MB/sec) 유효 대역폭 = 메시지 크기 = 통신시간 대역폭 1+지연시간ⅹ대역폭/메시지크기
  • 50. 통신부하 (3/4) Communication time Communication Time 1/slope = Bandwidth Latency message size
  • 51. 통신부하 (4/4) Effective Bandwidth effective bandwidth (MB/sec) 1000 network bandwidth 100 10 1 • latency = 22 ㎲ • bandwidth = 133 MB/sec 0.1 0.01 1 10 100 1000 10000 100000 1000000 message size(bytes)
  • 52. 세분성 (1/2) 병렬 프로그램내의 통신시간에 대한 계산시간의 비  Fine-grained 병렬성 • 통신 또는 동기화 사이의 계산작업이 상대적으로 적음 • 로드 밸런싱에 유리  Coarse-grained 병렬성 • 통신 또는 동기화 사이의 계산작업이 상대적으로 많음 • 로드 밸런싱에 불리 일반적으로 Coarse-grained 병렬성이 성능면에서 유리  계산시간 < 통신 또는 동기화 시간  알고리즘과 하드웨어 환경에 따라 다를 수 있음
  • 53. 세분성 (2/2) time time Communication Communication Computation Computation (a) Fine-grained (b) Coarse-grained
  • 54. 입출력 일반적으로 병렬성을 방해함  쓰기 : 동일 파일공간을 이용할 경우 겹쳐 쓰기 문제  읽기 : 다중 읽기 요청을 처리하는 파일서버의 성능 문제  네트워크를 경유(NFS, non-local)하는 입출력의 병목현상 입출력을 가능하면 줄일 것  I/O 수행을 특정 순차영역으로 제한해 사용  지역적인 파일공간에서 I/O 수행 병렬 파일시스템의 개발 (GPFS, PVFS, PPFS…) 병렬 I/O 프로그래밍 인터페이스 개발 (MPI-2 : MPI I/O)
  • 55. 확장성 (1/2) 확장된 환경에 대한 성능이득을 누릴 수 있는 능력  하드웨어적 확장성  알고리즘적 확장성 확장성에 영향을 미치는 주요 하드웨어적 요인  CPU-메모리 버스 대역폭  네트워크 대역폭  메모리 용량  프로세서 클럭 속도
  • 56. Speedu p 확장성 (2/2) Number of Workers
  • 57. 의존성과 교착 데이터 의존성 : 프로그램의 실행 순서가 실행 결과에 영향을 미치는 것 DO k = 1, 100 F(k + 2) = F(k +1) + F(k) ENDDO 교착 : 둘 이상의 프로세스들이 서로 상대방의 이벤트 발생을 기다리는 상태 Process 1 X = 4 SOURCE = TASK2 RECEIVE (SOURCE,Y) DEST = TASK2 SEND (DEST,X) Z = X + Y Process 2 Y = 8 SOURCE = TASK1 RECEIVE (SOURCE,X) DEST = TASK1 SEND (DEST,Y) Z = X + Y
  • 58. 의존성 F(1) F(2) F(3) F(4) F(5) F(6) F(7) … F(n) 1 2 3 4 5 6 7 … n DO k = 1, 100 F(k + 2) = F(k +1) + F(k) ENDDO Serial F(1) F(2) F(3) F(4) F(5) F(6) F(7) … F(n) 1 2 3 5 8 13 21 … … F(1) F(2) F(3) F(4) F(5) F(6) F(7) … F(n) 1 2 3 5(4) 7 11 18 … … Parallel
  • 59. 병렬 프로그램 작성 순서 ① 순차코드 작성, 분석(프로파일링), 최적화   ② hotspot, 병목지점, 데이터 의존성 등을 확인 데이터 병렬성/태스크 병렬성 ? 병렬코드 개발  MPI/OpenMP/… ?  태스크 할당과 제어, 통신, 동기화 코드 추가 ③ 컴파일, 실행, 디버깅 ④ 병렬코드 최적화  성능측정과 분석을 통한 성능개선
  • 60. 디버깅과 성능분석 디버깅  코드 작성시 모듈화 접근 필요  통신, 동기화, 데이터 의존성, 교착 등에 주의  디버거 : TotalView 성능측정과 분석  timer 함수 사용  프로파일러 : prof, gprof, pgprof, TAU
  • 61. Coffee break
  • 62. I. Introduction to Parallel Computing
  • 63. OpenMP란 무엇인가? 공유메모리 환경에서 다중 스레드 병렬 프로그램 작성을 위한 응용프로그램 인터페이스(API)
  • 64. OpenMP의 역사 1990년대 :  고성능 공유 메모리 시스템의 발전  업체 고유의 지시어 집합 사용  표준화의 필요성 1994년 ANSI X3H5  1996년 openmp.org 설립 1997년 OpenMP API 발표 Release History  OpenMP Fortran API 버전 1.0 : 1997년 10월  C/C++ API 버전 1.0 : 1998년 10월  Fortran API 버전 1.1 : 1999년 11월  Fortran API 버전 2.0 : 2000년 11월  C/C++ API 버전 2.0 : 2002년 3월  Combined C/C++ and Fortran API 버전 2.5 : 2005년 5월  API 버전 3.0 : 2008년 5월
  • 65. OpenMP의 목표 표준과 이식성 공유메모리 병렬 프로그래밍의 표준 대부분의 Unix와 Windows에 OpenMP 컴파일러 존재 Fortran, C/C++ 지원
  • 66. OpenMP의 구성 (1/2) Directives Runtime Library Environment Variables
  • 67. OpenMP의 구성 (2/2) 컴파일러 지시어  스레드 사이의 작업분담, 통신, 동기화를 담당  좁은 의미의 OpenMP 예) C$OMP PARALLEL DO 실행시간 라이브러리  병렬 매개변수(참여 스레드의 개수, 번호 등)의 설정과 조회 예) CALL omp_set_num_threads(128) 환경변수  실행 시스템의 병렬 매개변수(스레드 개수 등)를 정의 예) export OMP_NUM_THREADS=8
  • 68. OpenMP 프로그래밍 모델 (1/4) 컴파일러 지시어 기반  순차코드의 적절한 위치에 컴파일러 지시어 삽입  컴파일러가 지시어를 참고하여 다중 스레드 코드 생성  OpenMP를 지원하는 컴파일러 필요  동기화, 의존성 제거 등의 작업 필요
  • 69. OpenMP 프로그래밍 모델 (2/4) Fork-Join   병렬화가 필요한 부분에 다중 스레드 생성 병렬계산을 마치면 다시 순차적으로 실행 F J F J O O O O Master R I R I Thread K N K N [Parallel Region] [Parallel Region]
  • 70. OpenMP 프로그래밍 모델 (3/4) 컴파일러 지시어 삽입 Serial Code PROGRAM exam … ialpha = 2 DO i = 1, 100 a(i) = a(i) + ialpha*b(i) ENDDO PRINT *, a END Parallel Code PROGRAM exam … ialpha = 2 !$OMP PARALLEL DO DO i = 1, 100 a(i) = a(i) + ialpha*b(i) ENDDO !$OMP END PARALLEL DO PRINT *, a END
  • 71. OpenMP 프로그래밍 모델 (4/4) Fork-Join ※ export OMP_NUM_THREADS = 4 ialpha = 2 (Master Thread) (Fork) DO i=1,25 DO i=26,50 DO i=51,75 DO i=76,100 ... ... ... ... (Join) (Master) PRINT *, a (Slave) (Master Thread) (Slave) (Slave)
  • 72. OpenMP의 장점과 단점 장 점  MPI보다 코딩, 디버깅이 쉬움  데이터 분배가 수월 단 점 • 공유메모리환경의 다중 프로세서 아키텍처에서만 구현 가능  점진적 병렬화가 가능 • OpenMP를 지원하는 컴파일러 필요  하나의 코드를 병렬코드와 순차코 • 루프에 대한 의존도가 큼  낮은 드로 컴파일 가능  상대적으로 코드 크기가 작음 병렬화 효율성 • 공유메모리 아키텍처의 확장성 (프로세서 수, 메모리 등) 한계
  • 73. OpenMP의 전형적 사용 데이터 병렬성을 이용한 루프의 병렬화 1. 시간이 많이 걸리는 루프를 찾음 (프로파일링) 2. 의존성, 데이터 유효범위 조사 3. 지시어 삽입으로 병렬화 태스크 병렬성을 이용한 병렬화도 가능
  • 74. 지시어 (1/5) OpenMP 지시어 문법 Fortran (고정형식:f77) 지시어 시작 (감시문자) 줄 바꿈 선택적 컴파일 시작위치 Fortran (자유형식:f90) C ▪ !$OMP <지시어> ▪ C$OMP <지시어> ▪ !$OMP <지시어> ▪ #pragma omp ▪ !$OMP <지시어> & ▪ #pragma omp … ▪ *$OMP <지시어> ▪ !$OMP <지시어> !$OMP& … … … ▪ !$ … ▪ C$ … ▪ !$ … ▪ #ifdef _OPENMP ▪ *$ … 첫번째 열 무관 무관
  • 75. 지시어 (2/5) 병렬영역 지시어    PARALLEL/END PARALLEL 코드부분을 병렬영역으로 지정 지정된 영역은 여러 스레드에서 동시에 실행됨 작업분할 지시어    DO/FOR 병렬영역 내에서 사용 루프인덱스를 기준으로 각 스레드에게 루프작업 할당 결합된 병렬 작업분할 지시어   PARALLEL DO/FOR PARALLEL + DO/FOR의 역할을 수행
  • 76. 지시어 (3/5) 병렬영역 지정 Fortran !$OMP PARALLEL DO i = 1, 10 PRINT *, „Hello World‟, i ENDDO !$OMP END PARALLEL C #pragma omp parallel for(i=1; i<=10; i++) printf(“Hello World %dn”,i);
  • 77. 지시어 (4/5) 병렬영역과 작업분할 Fortran C !$OMP PARALLEL #pragma omp parallel !$OMP DO DO i = 1, 10 PRINT *, „Hello World‟, i ENDDO [!$OMP END DO] !$OMP END PARALLEL { #pragma omp for for(i=1; i<=10; i++) printf(“Hello World %dn”,i); }
  • 78. 지시어 (5/5) 병렬영역과 작업분할 Fortran !$OMP PARALLEL !$OMP DO DO i = 1, n a(i) = b(i) + c(i) ENDDO [!$OMP END DO] Optional !$OMP DO … [!$OMP END DO] !$OMP END PARALLEL C #pragma omp parallel { #pragma omp for for (i=1; i<=n; i++) { a[i] = b[i] + c[i] } #pragma omp for for(…){ … } }
  • 79. 실행시간 라이브러리와 환경변수 (1/3) 실행시간 라이브러리    omp_set_num_threads(integer) : 스레드 개수 지정 omp_get_num_threads() : 스레드 개수 리턴 omp_get_thread_num() : 스레드 ID 리턴 환경변수  OMP_NUM_THREADS : 사용 가능한 스레드 최대 개수 • export OMP_NUM_THREADS=16 (ksh) • setenv OMP_NUM_THREADS 16 (csh) C : #include <omp.h>
  • 80. 실행시간 라이브러리와 환경변수 (3/3) omp_set_num_threads omp_get_thread_num INTEGER OMP_GET_THREAD_NUM CALL OMP_SET_NUM_THREADS(4) Fortran !$OMP PARALLEL PRINT*, ′Thread rank: ′, OMP_GET_THREAD_NUM() !$OMP END PARALLEL #include <omp.h> omp_set_num_threads(4); C #pragma omp parallel { printf(″Thread rank:%d\n″,omp_get_thread_num()); }
  • 81. 주요 Clauses private(var1, var2, …) shared(var1, var2, …) default(shared|private|none) firstprivate(var1, var2, …) lastprivate(var1, var2, …) reduction(operator|intrinsic:var1, var2,…) schedule(type [,chunk])
  • 82. clause : reduction (1/4) reduction(operator|intrinsic:var1, var2,…)  reduction 변수는 shared • 배열 가능(Fortran only): deferred shape, assumed shape array 사 용 불가 • C는 scalar 변수만 가능  각 스레드에 복제돼 연산에 따라 다른 값으로 초기화되고(표 참조) 병렬 연산 수행  다중 스레드에서 병렬로 수행된 계산결과를 환산해 최종 결과를 마스터 스레드로 내 놓 음
  • 83. clause : reduction (2/4) !$OMP DO reduction(+:sum) DO i = 1, 100 sum = sum + x(i) ENDDO Thread 0 Thread 1 sum0 = 0 sum1 = 0 DO i = 1, 50 DO i = 51, 100 sum0 = sum0 + x(i) ENDDO sum = sum0 + sum1 sum1 = sum1 + x(i) ENDDO
  • 84. clause : reduction (3/4) Reduction Operators : Fortran Operator Data Types 초기값 + integer, floating point (complex or real) 0 * integer, floating point (complex or real) 1 - integer, floating point (complex or real) 0 .AND. logical .TRUE. .OR. logical .FALSE. .EQV. logical .TRUE. .NEQV. logical .FALSE. MAX integer, floating point (real only) 가능한 최소값 MIN integer, floating point (real only) 가능한 최대값 IAND integer all bits on IOR integer 0 IEOR integer 0
  • 85. clause : reduction (4/4) Reduction Operators : C Operator Data Types 초기값 + integer, floating point 0 * integer, floating point 1 - integer, floating point 0 & integer all bits on | integer 0 ^ integer 0 && integer 1 || integer 0
  • 86. Coffee break
  • 87. III. Parallel Programming using MPI
  • 88. Current HPC Platforms : COTS-Based Clusters COTS = Commercial off-the-shelf Nehalem Access Control File Server(s) Gulftown … Login Node(s) 88 Compute Nodes
  • 89. Memory Architectures Shared Memory  Single address space for all processors <NUMA> <UMA> Distributed Memory 89
  • 90. What is MPI? MPI = Message Passing Interface MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a library – but rather the specification of what such a library should be. MPI primarily addresses the message-passing parallel programming model : data is moved from the address space of one process to that of another process through cooperative operations on each process. Simply stated, the goal of the message Passing Interface is to provide a widely used standard for writing message passing programs. The interface attempts to be :   Portable  Efficient  90 Practical Flexible
  • 91. What is MPI? The MPI standard has gone through a number of revisions, with the most recent version being MPI-3. Interface specifications have been defined for C and Fortran90 language bindings :  C++ bindings from MPI-1 are removed in MPI-3  MPI-3 also provides support for Fortran 2003 and 2008 features Actual MPI library implementations differ in which version and features of the MPI standard they support. Developers/users will need to be aware of this. 91
  • 92. Programming Model Originally, MPI was designed for distributed memory architectures, which were becoming increasingly popular at time (1980s – early 1990s). As architecture trends changed, shared memory SMPs were combined over networks creating hybrid distributed memory/shared memory systems. 92
  • 93. Programming Model MPI implementers adapted their libraries to handle both types of underlying memory architectures seamlessly. They also adapted/developed ways of handing different interconnects and protocols. Today, MPI runs on virtually any hardware platform :  Distributed Memory  Shared Memory  Hybrid The programming model clearly remains a distributed memory model however, regardless of the underlying physical architecture of the machine. 93
  • 94. Reasons for Using MPI Standardization  MPI is the only message passing library which can be considered a standard. It is supported on virtually all HPC platforms. Practically, it has replaced all previous message passing libraries. Portability  There is little or no need to modify your source code when you port your application to a different platform that supports (and is compliant with) the MPI standard. Performance Opportunities  Vendor implementations should be able to exploit native hardware features to optimize performance. Functionality  There are over 440 routines defined in MPI-3, which includes the majority of those in MPI-2 and MPI-1. Availability  94 A Variety of implementations are available, both vendor and public domain.
  • 95. History and Evolution MPI has resulted from the efforts of numerous individuals and groups that began in 1992. 1980s – early 1990s : Distributed memory, parallel computing develops, as do a number of incompatible soft ware tools for writing such programs – usually with tradeoffs between portability, performance, functionality and price. Recognition of the need for a standard arose. Apr 1992 : Workshop on Standards for Message Passing in a Distributed Memory Environment, Sponsored by the Center for Research on Parallel Computing, Williamsburg, Virginia. The basic features essential to a standard message passing interface were discussed, and a working group established to continue the standardization process. Preliminary draft proposal developed subsequently. 95
  • 96. History and Evolution Nov 1992 : Working group meets in Minneapolis. MPI draft proposal (MPI1) from ORNL presented. Group adopts procedures and organization to form the MPI Forum. It eventually comprised of about 175 individuals from 40 organizations including parallel computer vendors, software writers, academia and application scientists. Nov 1993 : Supercomputing 93 conference – draft MPI standard presented. May 1994 : Final version of MPI-1.0 released. MPI-1.0 was followed by versions MPI-1.1 (Jun 1995), MPI-1.2 (Jul 1997) and MPI-1.3 (May 2008). MPI-2 picked up where the first MPI specification left off, and addressed topics which went far beyond the MPI-1 specification. Was finalized in 1996. MPI-2.1 (Sep 2009), and MPI-2.2 (Sep 2009) followed. Sep 2012 : The MPI-3.0 standard was approved. 96
  • 97. History and Evolution Documentation for all versions of the MPI standard is available at :  97 http://www.mpi-forum.org/docs/
  • 98. A General Structure of the MPI Program 98
  • 99. A Header File for MPI routines Required for all programs that make MPI library calls. C include file Fortran include file #include “mpi.h” include „mpif.h‟ With MPI-3 Fortran, the USE mpi_f80 module is preferred over using the include file shown above. 99
  • 100. The Format of MPI Calls C names are case sensitive; Fortran name are not. Programs must not declare variables or functions with names beginning with the prefix MPI_ or PMPI_ (profiling interface). C Binding Format rc = MPI_Xxxxx(parameter, …) Example rc = MPI_Bsend(&buf, count, type, dest, tag, comm) Error code Returned as “rc”, MPI_SUCCESS if successful. Fortran Binding Format Example call MPI_BSEND(buf, count, type, dest, tag, comm, ierr) Error code 100 CALL MPI_XXXXX(parameter, …, ierr) call mpi_xxxxx(parameter, …, ierr) Returned as “ierr” parameter, MPI_SUCCESS if successful.
  • 101. Communicators and Groups MPI uses objects called communicators and groups to define which collection of processes may communicate with each other. Most MPI routines require you to specify a communicator as an argument. Communicators and groups will be covered in more detail later. For now, simply use MPI_COMM_WORLD whenever a communicator is required - it is the predefined communicator that includes all of your MPI processes. 101
  • 102. Rank Within a communicator, every process has its own unique, integer identifier assigned by the system when the process initializes. A rank is sometimes also called a “task ID”. Ranks are contiguous and begin at zero. Used by the programmer to specify the source and destination of messages. Often used conditionally by the application to control program execution (if rank = 0 do this / if rank = 1 do that). 102
  • 103. Error Handling Most MPI routines include a return/error code parameter, as described in “Format of MPI Calls” section above. However, according to the MPI standard, the default behavior of an MPI call is to abort if there is an error. This means you will probably not be able to capture a return/error code other than MPI_SUCCESS (zero). The standard does provide a means to override this default error handler. You can also consult the error handing section of the MPI Standard located at http://www.mpiforum.org/docs/mpi-11-html/node148.html . The types of errors displayed to the user are implementation dependent. 103
  • 104. Environment Management Routines MPI_Init  Initializes the MPI execution environment. This function must be called is every MPI program, must be called before any other MPI functions and must be called only once in an MPI program. For C programs, MPI_Init may be used to pass the command line arguments to all processes, although this is not required by the standard and is implementation dependent. C MPI_Init(&argc, &argv)   104 Fortran MPI_INIT(ierr) Input parameters • argc : Pointer to the number of arguments • argv : Pointer to the argument vector ierr : the error return argument
  • 105. Environment Management Routines MPI_Comm_size  Returns the total number of MPI processes in the specified communicator, such as MPI_COMM_WORLD. If the communicator is MPI_COMM_WORLD, then it represents the number of MPI tasks available to your application. C MPI_Comm_size(comm, &size)    105 Fortran MPI_COMM_SIZE(comm, size, ierr) Input parameters • comm : communicator (handle) Output parameters • size : number of processes in the group of comm (integer) ierr : the error return argument
  • 106. Environment Management Routines MPI_Comm_rank  Returns the rank of the calling MPI process within the specified communicator. Initially, each process will be assigned a unique integer rank between 0 and number of tasks -1 within the communicator MPI_COMM_WORLD. This rank is often referred to as a task ID. If a process becomes associated with other communicators, it will have a unique rank within each of these as well. C MPI_Comm_rank(comm, &rank)    106 Fortran MPI_COMM_SIZE(comm, rank, ierr) Input parameters • comm : communicator (handle) Output parameters • rank : rank of the calling process in the group of comm (integer) ierr : the error return argument
  • 107. Environment Management Routines MPI_Finalize  Terminates the MPI execution environment. This function should be the last MPI routine called in every MPI program – no other MPI routines may be called after it. C MPI_Finalize()  107 ierr : the error return argument Fortran MPI_FINALIZE(ierr)
  • 108. Environment Management Routines MPI_Abort  Terminates all MPI processes associated with the communicator. In most MPI implementations it terminates ALL processes regardless of the communicator specified. C MPI_Abort(comm, errorcode)   108 Fortran MPI_ABORT(comm, errorcode, ierr) Input parameters • comm : communicator (handle) • errorcode : error code to return to invoking environment ierr : the error return argument
  • 109. Environment Management Routines MPI_Get_processor_name  Return the processor name. Also returns the length of the name. The buffer for “name” must be at least MPI_MAX_PROCESSOR_NAME characters in size. What is returned into “name” is implementation dependent – may not be the same as the output of the “hostname” or “host” shell commands. C Fortran MPI_Get_processor_name(&name, &resultlength) MPI_GET_PROCESSOR_NAME(n ame, resultlength, ierr)   109 Output parameters • name : A unique specifies for the actual (as opposed to virtual) node. This must be an array of size at least MPI_MAX_PROCESOR_NAME . • resultlen : Length (in characters) of the name. ierr : the error return argument
  • 110. Environment Management Routines MPI_Get_version  Returns the version (either 1 or 2) and subversion of MPI. C MPI_Get_version(&version, &subversion)   110 Fortran MPI_GET_VERSION(version, subversion, ierr) Output parameters • version : Major version of MPI (1 or 2) • subversion : Miner version of MPI. ierr : the error return argument
  • 111. Environment Management Routines MPI_Initialized  Indicates whether MPI_Init has been called – returns flag as either logical true(1) or false(0). C MPI_Initialized(&flag)   111 Fortran MPI_INITIALIZED(flag, ierr) Output parameters • flag : Flag is true if MPI_Init has been called and false otherwise. ierr : the error return argument
  • 112. Environment Management Routines MPI_Wtime  Returns an elapsed wall clock time in seconds (double precision) on the calling processor. C MPI_Wtime()  Fortran MPI_WTIME() Return value • Time in seconds since an arbitrary time in the past. MPI_Wtick  Returns the resolution in seconds (double precision) of MPI_Wtime. C MPI_Wtick()  112 Fortran MPI_WTICK() Return value • Time in seconds of the resolution MPI_Wtime.
  • 113. Example: Hello world #include<stdio.h> #include"mpi.h" int main(int argc, char *argv[]) { int rc; rc = MPI_Init(&argc, &argv); printf("Hello world.n"); rc = MPI_Finalize(); return 0; } 113
  • 114. Example: Hello world Execute a mpi program. $ module load [compiler] [mpi] $ mpicc hello.c $ mpirun –np 4 –hostfile [hostfile] ./a.out Make out a hostfile. ibs0001 ibs0002 ibs0003 ibs0003 … 114 slots=2 slots=2 slots=2 slots=2
  • 115. Example : Environment Management Routine #include "mpi.h” #include <stdio.h> int main(argc,argv) int argc; char *argv[]; { int numtasks, rank, len, rc; char hostname[MPI_MAX_PROCESSOR_NAME]; rc = MPI_Init(&argc,&argv); if (rc != MPI_SUCCESS) { printf ("Error starting MPI program. Terminating.n"); MPI_Abort(MPI_COMM_WORLD, rc); } MPI_Comm_size(MPI_COMM_WORLD,&numtasks); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Get_processor_name(hostname, &len); printf ("Number of tasks= %d My rank= %d Running on %sn", numtasks,rank,hostname); /******* do some work *******/ rc = MPI_Finalize(); return 0; } 115
  • 116. Types of Point-to-Point Operations MPI point-to-point operations typically involve message passing between two, and only two, different MPI tasks. One task is performing a send operation and the other task is performing a matching receive operation. There are different types of send and receive routines used for different purposes.  Synchronous send  Blocking send/blocking receive  Non-blocking send/non-blocking receive  Buffered send  Combined send/receive  “Ready” send Any type of send routine can be paired with any type of receive routine. MPI also provides several routines associated with send – receive operations, such as those used to wait for a message’s arrival or prove to find out if a message has arrived. 116
  • 117. Buffering In a perfect world, every send operation would be perfectly synchronized with its matching re ceive. This is rarely the case. Somehow or other, the MPI implementation must be able to deal with storing data when the two tasks are out of sync. Consider the following two cases   117 A send operation occurs 5 seconds before the receive is ready – where is the message w hile the receive is pending? Multiple sends arrive at the same receiving task which can only accept one send at a tim e – what happens to the messages that are “backing up”?
  • 118. Buffering The MPI implementation (not the MPI standard) decides what happens to data in these types of cases. Typically, a system buffer area is reserved to hold data in transit. 118
  • 119. Buffering System buffer space is :      119 Opaque to the programmer and managed entirely by the MPI library A finite resource that can be easy to exhaust Often mysterious and not well documented Able to exist on the sending side, the receiving side, or both Something that may improve program performance because it allows send – receive ope rations to be asynchronous.
  • 120. Blocking vs. Non-blocking Most of the MPI point-to-point routines can be used in either blocking or non-blocking mode. Blocking     A blocking send routine will only “return” after it is safe to modify the application buffer (your send data) for reuse. Safe means that modifications will not affect the data intended for the rec eive task. Safe dose not imply that the data was actually received – it may very well be sitting i n a system buffer. A blocking send can be synchronous which means there is handshaking occurring with the re ceive task to confirm a safe send. A blocking send can be asynchronous if a system buffer is used to hold the data for eventual d elivery to the receive. A blocking receive only “returns” after the data has arrived and is ready for use by the progra m. Non-blocking     120 Non-blocking send and receive routines behave similarly – they will return almost immediately. They do not wait for any communication events to complete, such as message copying from u ser memory to system buffer space or the actual arrival of message. Non-blocking operations simply “request” the MPI library to perform the operation when it is a ble. The user can not predict when it is able. The user can not predict when that will happen. It is unsafe to modify the application buffer (your variable space) until you know for a fact the r equested non-blocking operation was actually performed by the library. There are “wait” routin es used to do this. Non-blocking communications are primarily used to overlap computation with communication and exploit possibale performance gains.
  • 121. MPI Message Passing Routine Arguments MPI point-to-point communication routines generally have an argument list that takes one of t he following formats : Blocking sends MPI_Send(buffer, count, type, dest, tag, comm) Non-blocking sends MPI_Isend(buffer, count, type, dest, tag, comm, request) Blocking receive MPI_Recv(buffer, count, type, source, tag, comm, status) Non-blocking receive MPI_Irecv(buffer, count, type, source, tag, comm, request) Buffer  Program (application) address space that references the data that is to be sent or receiv ed. In most cases, this is simply the variable name that is be sent/received. For C progra ms, this argument is passed by reference and usually must be prepended with an amper sand : &var1 Data count  121 Indicates the number of data elements of a particular type to be sent.
  • 122. MPI Message Passing Routine Arguments Data type  For reasons of portability, MPI predefines its elementary data types. The table below lists those required by the standard. C Data Types MPI_CHAR MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_SIGNED_CHAR signed char MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE 122 signed char long double
  • 123. MPI Message Passing Routine Arguments Destination  An argument to send routines that indicates the process where a message should be del ivered. Specified as the rank of the receiving process. Tag  Arbitrary non-negative integer assigned by the programmer to uniquely identify a messa ge. Send and receive operations should match message tags. For a receive operation, th e wild card MPI_ANY_TAG can be used to receive any message regardless of its tag. The MPI standard guarantees that integers 0 – 32767 can be used as tags, but most impleme ntations allow a much larger range than this. Communicator  123 Indicates the communication context, or set of processes for which the source or destin ation fields are valid. Unless the programmer is explicitly creating new communicator, th e predefined communicator MPI_COMM_WORLD is usually used.
  • 124. MPI Message Passing Routine Arguments Status     For a receive operation, indicates the source of the message and the tag of the message. In C, this argument is a pointer to predefined structure MPI_Status (ex. stat.MPI_SOURC E, stat.MPI_TAG). In Fortran, it is an integer array of size MPI_STATUS_SIZE (ex. stat(MPI_SOURCE), stat(M PI_TAG)). Additionally, the actual number of bytes received are obtainable from Status via MPI_Get _out routine. Request      124 Used by non-blocking send and receive operations. Since non-blocking operations may return before the requested system buffer space is o btained, the system issues a unique “request number”. The programmer uses this system assigned “handle” later (in a WAIT type routine) to det ermine completion of the non-blocking operation. In C, this argument is pointer to predefined structure MPI_Request. In Fortran, it is an integer.
  • 125. Example : Blocking Message Passing Routine (1/2) #include "mpi.h" #include <stdio.h> int main(argc,argv) int argc; char *argv[]; { int numtasks, rank, dest, source, rc, count, tag=1; char inmsg, outmsg='x'; MPI_Status Stat; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { dest = 1; source = 1; rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); } else if (rank == 1) { dest = 0; source = 0; rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); } 125
  • 126. Example : Blocking Message Passing Routine (2/2) rc = MPI_Get_count(&Stat, MPI_CHAR, &count); printf("Task %d: Received %d char(s) from task %d with tag %d n", rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG); MPI_Finalize(); return 0; } 126
  • 127. Example : Dead Lock #include "mpi.h" #include <stdio.h> int main(argc,argv) int argc; char *argv[]; { int numtasks, rank, dest, source, rc, count, tag=1; char inmsg, outmsg='x'; MPI_Status Stat; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { dest = 1; source = 1; rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); } else if (rank == 1) { dest = 0; source = 0; rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); } 127
  • 128. Example : Non-Blocking Message Passing Routine (1/2) Nearest neighbor exchange in a ring topology #include "mpi.h" #include <stdio.h> int main(argc,argv) int argc; char *argv[]; { int numtasks, rank, next, prev, buf[2], tag1=1, tag2=2; MPI_Request reqs[4]; MPI_Status stats[2]; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); prev = rank-1; next = rank+1; if (rank == 0) prev = numtasks - 1; if (rank == (numtasks - 1)) next = 0; 128
  • 129. Example : Non-Blocking Message Passing Routine (2/2) MPI_Irecv(&buf[0], 1, MPI_INT, prev, tag1, MPI_COMM_WORLD, &reqs[0]); MPI_Irecv(&buf[1], 1, MPI_INT, next, tag2, MPI_COMM_WORLD, &reqs[1]); MPI_Isend(&rank, 1, MPI_INT, prev, tag2, MPI_COMM_WORLD, &reqs[2]); MPI_Isend(&rank, 1, MPI_INT, next, tag1, MPI_COMM_WORLD, &reqs[3]); { do some work } MPI_Waitall(4, reqs, stats); MPI_Finalize(); return 0; } 129
  • 130. Advanced Example : Monte-Carlo Simulation <Problem>    Monte carlo simulation Random number use PI = 4 ⅹAc/As <Requirement>   N’s processor(rank) use P2p communication r 130
  • 131. Advanced Example : Monte-Carlo Simulation for PI #include <stdio.h> #include <stdlib.h> #include <math.h> int main() { const long num_step=100000000; long i, cnt; double pi, x, y, r; printf(“-----------------------------------------------------------n”); pi = 0.0; cnt = 0; r = 0.0; for (i=0; i<num_step; i++) { x = rand() / (RAND_MAX+1.0); y = rand() / (RAND_MAX+1.0); r = sqrt(x*x + y*y); if (r<=1) cnt += 1; } pi = 4.0 * (double)(cnt) / (double)(num_step); printf(“PI = %17.15lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi)); printf(“-----------------------------------------------------------n”); return 0; } 131
  • 132. Advanced Example : Numerical integration for PI <Problem>  Get PI using Numerical integration 1 0 f ( x1 ) f ( x2 ) 4.0 dx = 2) (1+x f ( xn ) <Requirement>  Point to point communication n 4 i 1 1 2 1 ((i 0.5) ) n 1 n .... 1 n 1 (2 0.5) n 1 (1 0.5) n x2 x1 132 xn (n 0.5) 1 n
  • 133. Advanced Example : Numerical integration for PI #include <stdio.h> #include <math.h> int main() { const long num_step=100000000; long i; double sum, step, pi, x; step = (1.0/(double)num_step); sum=0.0; printf(“-----------------------------------------------------------n”); for (i=0; i<num_step; i++) { x = ((double)i - 0.5) * step; sum += 4.0/(1.0+x*x); } pi = step * sum; printf(“PI = %5lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi)); printf(“-----------------------------------------------------------n”); return 0; } 133
  • 134. Type of Collective Operations Synchronization  processes wait until all members of the group have reached the synchronization point. Data Movement  broadcast, scatter/gather, all to all. Collective Computation (reductions)  134 one member of the group collects data from the other members and performs an operati on (min, max, add, multiply, etc.) on that data.
  • 135. Programming Considerations and Restrictions With MPI-3, collective operations can be blocking or non-blocking. Only blocking operations are covered in this tutorial. Collective communication routines do not take message tag arguments. Collective operations within subset of processes are accomplished by first partitioning the su bsets into new groups and then attaching the new groups to new communicators. Con only be used with MPI predefined datatypes – not with MPI Derived Data Types. MPI-2 extended most collective operations to allow data movement between intercommunicat ors (not covered here). 135
  • 136. Collective Communication Routines MPI_Barrier  Synchronization operation. Creates a barrier synchronization in a group. Each task, when reaching the MPI_Barrier call, blocks until all tasks in the group reach the same MPI_Barrier call. Then all tasks are free to proceed. C MPI_Barrier(comm) 136 Fortran MPI_BARRIER(comm, ierr)
  • 137. Collective Communication Routines MPI_Bcast  Data movement operation. Broadcasts (sends) a message from the process with rank "r oot" to all other processes in the group. C MPI_Bcast(&buffer, count, datatype, root, comm) 137 Fortran MPI_BCAST (buffer,count,datatype,root,comm,ier r)
  • 138. Collective Communication Routines MPI_Scatter  Data movement operation. Distributes distinct messages from a single source task to ea ch task in the group. C Fortran MPI_Scatter MPI_SCATTER (&sendbuf,sendcnt,sendtype,&recvb (sendbuf,sendcnt,sendtype,recvbuf, uf, recvcnt,recvtype,root,comm) recvcnt,recvtype,root,comm,ierr) 138
  • 139. Collective Communication Routines MPI_Gather  Data movement operation. Gathers distinct messages from each task in the group to a si ngle destination task. This routine is the reverse operation of MPI_Scatter. C Fortran MPI_Gather MPI_GATHER (&sendbuf,sendcnt,sendtype,&recvb (sendbuf,sendcnt,sendtype,recvbuf, uf, recvcount,recvtype,root,comm) recvcount,recvtype,root,comm,ierr) 139
  • 140. Collective Communication Routines MPI_Allgather  Data movement operation. Concatenation of data to all tasks in a group. Each task in the group, in effect, performs a one-to-all broadcasting operation within the group. C Fortran MPI_Allgather MPI_ALLGATHER (&sendbuf,sendcount,sendtype,&rec (sendbuf,sendcount,sendtype,recvb vbuf, recvcount,recvtype,comm) uf, recvcount,recvtype,comm,info) 140
  • 141. Collective Communication Routines MPI_Reduce  Collective computation operation. Applies a reduction operation on all tasks in the group and places the result in one task. C MPI_Reduce (&sendbuf,&recvbuf,count,datatype, op,root,comm) 141 Fortran MPI_REDUCE (sendbuf,recvbuf,count,datatype,op, root,comm,ierr)
  • 142. Collective Communication Routines The predefined MPI reduction operations appear below. Users can also define their own reduction functions by using the MPI_Op_create routine. MPI Reduction Operation C Data Types MPI_MAX maximum integer, float MPI_MIN minimum integer, float MPI_SUM sum integer, float MPI_PROD product integer, float MPI_LAND logical AND integer MPI_BAND bit-wise AND integer, MPI_BYTE MPI_LOR logical OR integer MPI_BOR bit-wise OR integer, MPI_BYTE MPI_LXOR logical XOR integer MPI_BXOR bit-wise XOR integer, MPI_BYTE MPI_MAXLOC max value and location float, double and long double MPI_MINLOC min value and location float, double and long double 142
  • 143. Collective Communication Routines MPI_Allreduce  Collective computation operation + data movement. Applies a reduction operation and pl aces the result in all tasks in the group. This is equivalent to an MPI_Reduce followed by an MPI_Bcast. C MPI_Allreduce (&sendbuf,&recvbuf,count,datatype, op,comm) 143 Fortran MPI_ALLREDUCE (sendbuf,recvbuf,count,datatype,op, comm,ierr)
  • 144. Collective Communication Routines MPI_Reduce_scatter  Collective computation operation + data movement. First does an element-wise reductio n on a vector across all tasks in the group. Next, the result vector is split into disjoint se gments and distributed across the tasks. This is equivalent to an MPI_Reduce followed b y an MPI_Scatter operation. C MPI_Reduce_scatter (&sendbuf,&recvbuf,recvcount,datat ype, op,comm) 144 Fortran MPI_REDUCE_SCATTER (sendbuf,recvbuf,recvcount,datatype, op,comm,ierr)
  • 145. Collective Communication Routines MPI_Alltoall  Data movement operation. Each task in a group performs a scatter operation, sending a distinct message to all the tasks in the group in order by index. C Fortran MPI_Alltoall MPI_ALLTOALL (&sendbuf,sendcount,sendtype,&rec (sendbuf,sendcount,sendtype,recvb vbuf, recvcnt,recvtype,comm) uf, recvcnt,recvtype,comm,ierr) 145
  • 146. Collective Communication Routines MPI_Scan  Performs a scan operation with respect to a reduction operation across a task group. C MPI_Scan (&sendbuf,&recvbuf,count,datatype, op,comm) 146 Fortran MPI_SCAN (sendbuf,recvbuf,count,datatype,op, comm,ierr)
  • 147. Collective Communication Routines data P0 A A P0 A A P1 B P2 A P2 C P3 A P3 D broadcast P1 A*B*C*D reduce *:some operator P0 A B C D A P0 A P1 B P1 B P2 C P2 C A*B*C*D D P3 D A*B*C*D scatter gather P3 A*B*C*D all reduce A*B*C*D *:some operator P0 A A B C D P0 A P1 B A B C D P1 B P2 C A B C D P2 C A*B*C P3 D A B C D P3 D A*B*C*D allgather A scan A*B *:some operator P0 A0 A1 A2 A3 alltoall A0 B0 C0 D0 P0 A0 A1 A2 A0*B0*C0*D0 A3 reduce scatter A1*B1*C1*D1 P1 B0 B1 B2 B3 A1 B1 C1 D1 P1 B0 B1 B2 B3 P2 C0 C1 C2 C3 A2 B2 C2 D2 P2 C0 C1 C2 C3 A2*B2*C2*D2 P3 D0 D1 D2 D3 A3 B3 C3 D3 P3 D0 D1 D2 D3 A3*B3*C3*D3 *:some operator 147
  • 148. Example : Collective Communication (1/2) Perform a scatter operation on the rows of an array #include "mpi.h" #include <stdio.h> #define SIZE 4 int main(argc,argv) int argc; char *argv[]; { int numtasks, rank, sendcount, recvcount, source; float sendbuf[SIZE][SIZE] = { {1.0, 2.0, 3.0, 4.0}, {5.0, 6.0, 7.0, 8.0}, {9.0, 10.0, 11.0, 12.0}, {13.0, 14.0, 15.0, 16.0} }; float recvbuf[SIZE]; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); 148
  • 149. Example : Collective Communication (2/2) if (numtasks == SIZE) { source = 1; sendcount = SIZE; recvcount = SIZE; MPI_Scatter(sendbuf,sendcount,MPI_FLOAT,recvbuf,recvcount, MPI_FLOAT,source,MPI_COMM_WORLD); printf("rank= %d Results: %f %f %f %fn",rank,recvbuf[0], recvbuf[1],recvbuf[2],recvbuf[3]); } else printf("Must specify %d processors. Terminating.n",SIZE); MPI_Finalize(); return 0; } 149
  • 150. Advanced Example : Monte-Carlo Simulation for PI Use the collective communication routines! #include <stdio.h> #include <stdlib.h> #include <math.h> int main() { const long num_step=100000000; long i, cnt; double pi, x, y, r; printf(“-----------------------------------------------------------n”); pi = 0.0; cnt = 0; r = 0.0; for (i=0; i<num_step; i++) { x = rand() / (RAND_MAX+1.0); y = rand() / (RAND_MAX+1.0); r = sqrt(x*x + y*y); if (r<=1) cnt += 1; } pi = 4.0 * (double)(cnt) / (double)(num_step); printf(“PI = %17.15lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi)); printf(“-----------------------------------------------------------n”); return 0; } 150
  • 151. Advanced Example : Numerical integration for PI Use the collective communication routines! #include <stdio.h> #include <math.h> int main() { const long num_step=100000000; long i; double sum, step, pi, x; step = (1.0/(double)num_step); sum=0.0; printf(“-----------------------------------------------------------n”); for (i=0; i<num_step; i++) { x = ((double)i - 0.5) * step; sum += 4.0/(1.0+x*x); } pi = step * sum; printf(“PI = %5lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi)); printf(“-----------------------------------------------------------n”); return 0; } 151
  • 152. Any questions?