本報告では,自動チューニング(AT)を実行するに当たり,コード最適化時に動的なコード生成とコンパイルを行わず,実行前に静的に生成したコードのみを利用するATソフトウェア構成方式のStatic Code Generation Auto-tuning (SCG-AT)を提案する.SCG-ATによるATを評価するにあたり「階層型AT処理」を実装した.差分法による地震波シミュレーションppOpen-APPL/FDMにおいて,従来のベクトル計算機向けコードと新規開発したスカラ計算機向けコードのコード選択処理を実装した.Xeon Phi,Ivy Bridge,およびFX10の3種の全く異なる計算機でSSG-ATによるコード選択のATを評価した.評価の結果,Xeon PhiとIvy Bridgeにおいてはスカラ計算機向けコードの選択により,従来行われていたAT方式では達成できない速度向上が達成できることを明らかにした.
------
ここに掲載した著作物の利用に関する注意 本著作物の著作権は情報処理学会に帰属します。本著作物は著作権者である情報処理学会の許可のもとに掲載するものです。ご利用に当たっては「著作権法」ならびに「情報処理学会倫理綱領」に従うことをお願いいたします。
Notice for the use of this material The copyright of this material is retained by the Information Processing Society of Japan (IPSJ). This material is published on this web site with the agreement of the author (s) and the IPSJ. Please be complied with Copyright Law of Japan and the Code of Ethics of the IPSJ if any users wish to reproduce, make derivative work, distribute or make available to the public any part or whole thereof.
All Rights Reserved, Copyright (C) Information Processing Society of Japan.
Comments are welcome. Mail to address editj@ipsj.or.jp, please.
4. Background
• High‐Thread Parallelism (HTP)
– Multi‐core and many‐core processors are
pervasive.
• Multicore CPUs: 16‐32 cores, 32‐64 Threads with
Hyper Threading (HT) or Simultaneous Multithreading
(SMT)
• Many Core CPU: Xeon Phi – 60 cores, 240 Threads
with HT.
– Utilizing parallelism with full‐threads is
important.
4
Performance Portability (PP)
◦ Keeping high performance in multiple computer environments.
Not only multiple CPUs, but also multiple compilers.
Run‐time information, such as loop length and
number of threads, is important.
◦ Auto‐tuning (AT) is one of candidates technologies to establish
PP in multiple computer environments.
5. ppOpen‐HPC Project
• Middleware for HPC and Its AT Technology
– Supported by JST, CREST, from FY2011 to FY2015.
– PI: Professor Kengo Nakajima (U. Tokyo)
• ppOpen‐HPC
– An open source infrastructure for reliable simulation codes on
post‐peta (pp) scale parallel computers.
– Consists of various types of libraries, which covers
5 kinds of discretization methods for scientific computations.
• ppOpen‐AT
– An auto‐tuning language for ppOpen‐HPC codes
– Using knowledge of previous project: ABCLibScript Project.
– Auto‐tuning language based on directives for AT.
5
19. スカラ計算機向け実装 (Intel向き)
!$omp parallel do private
(i,j,k,RL1,RM1,RM2,RLRM2,DXVX, …)
do k_j=1, (NZ01‐NZ00+1)*(NY01‐NY00+1)
k=(k_j‐1)/(NY01‐NY00+1)+NZ00
j=mod((k_j‐1),(NY01‐NY00+1))+NY00
do i = NX00, NX01
RL1 = LAM (I,J,K); RM1 = RIG (I,J,K)
RM2 = RM1 + RM1; RLRM2 = RL1+RM2
! 4th order diff
DXVX0 = (VX(I,J,K) ‐VX(I‐1,J,K))*C40/dx &
‐ (VX(I+1,J,K)‐VX(I‐2,J,K))*C41/dx
DYVY0 = (VY(I,J,K) ‐VY(I,J‐1,K))*C40/dy &
‐ (VY(I,J+1,K)‐VY(I,J‐2,K))*C41/dy
DZVZ0 = (VZ(I,J,K) ‐VZ(I,J,K‐1))*C40/dz &
‐ (VZ(I,J,K+1)‐VZ(I,J,K‐2))*C41/dz
! truncate_diff_vel
! X dir
if (idx==0) then
if (i==1)then
DXVX0 = ( VX(1,J,K) ‐ 0.0_PN )/ DX
end if
if (i==2) then
DXVX0 = ( VX(2,J,K) ‐ VX(1,J,K) )/ DX
end if
end if
if( idx == IP‐1 ) then
if (i==NXP)then
DXVX0 = ( VX(NXP,J,K) ‐ VX(NXP‐1,J,K) ) / DX
end if
end if
! Y dir
if( idy == 0 ) then ! Shallowmost
if (j==1)then
DYVY0 = ( VY(I,1,K) ‐ 0.0_PN )/ DY
end if
if (j==2)then
DYVY0 = ( VY(I,2,K) ‐ VY(I,1,K) ) / DY
end if
end if
if( idy == JP‐1 ) then
if (j==NYP)then
DYVY0 = ( VY(I,NYP,K) ‐ VY(I,NYP‐1,K) )/ DY
end if
end if
! Z dir
if( idz == 0 ) then ! Shallowmost
if (k==1)then
DZVZ0 = ( VZ(I,J,1) ‐ 0.0_PN ) / DZ
end if
if (k==2)then
DZVZ0 = ( VZ(I,J,2) ‐ VZ(I,J,1) ) / DZ
end if
end if
if( idz == KP‐1 ) then
if (k==NZP)then
DZVZ0 = ( VZ(I,J,NZP) ‐ VZ(I,J,NZP‐1) )/ DZ
end if
end if
モデル境界の処理
中心差分近似
による空間微分
の評価
DXVX1 = DXVX0; DYVY1 = DYVY0
DZVZ1 = DZVZ0; D3V3 = DXVX1 + DYVY1 + DZVZ1
SXX (I,J,K) = SXX (I,J,K) &
+ (RLRM2*(D3V3)‐RM2*(DZVZ1+DYVY1) ) * DT
SYY (I,J,K) = SYY (I,J,K) &
+ (RLRM2*(D3V3)‐RM2*(DXVX1+DZVZ1) ) * DT
SZZ (I,J,K) = SZZ (I,J,K) &
+ (RLRM2*(D3V3)‐RM2*(DXVX1+DYVY1) ) * DT
end do
end do
!$omp end parallel d
Leapfrog法による
陽解法時間発展
B/F値は小さくなる
ループ内部に
IF文があり
最適化を阻害
28. Execution Details
• ppOpen‐APPL/FDM ver.0.2
• ppOpen‐AT ver.0.2
• The number of time step: 2000 steps
• The number of nodes: 8 node
• Native Mode Execution
• Target Problem Size
(Almost maximum size with 8 GB/node)
– NX * NY * NZ = 1536 x 768 x 240 / 8 Node
– NX * NY * NZ = 768 * 384 * 120 / node
(!= per MPI Process)
• The number of iterations for kernels
to do auto‐tuning: 100
29. Candidates in Conventional AT
1. Kernel update_stress
– 8 Kinds of Candidates with Loop Collapse and Loop Split.
2. Kernel update_vel
– 6 Kinds of Candidates with Loop Collapse and Re‐ordering of Statements.
3 Kinds of Candidates with Loop Collapse.
3. Kernel update_stress_sponge
4. Kernel update_vel_sponge
5. Kernel ppohFDM_pdiffx3_p4
6. Kernel ppohFDM_pdiffx3_m4
7. Kernel ppohFDM_pdiffy3_p4
8. Kernel ppohFDM_pdiffy3_m4
9. Kernel ppohFDM_pdiffz3_p4
10. Kernel ppohFDM_pdiffz3_m4
Kinds of Candidates with Loop Collapse for Data Packing and Data
Unpacking.
11. Kernel ppohFDM_ps_pack
12. Kernel ppohFDM_ps_unpack
13. Kernel ppohFDM_pv_pack
14. Kernel ppohFDM_pv_unpack
32
Total Number
of Kernel Candidates:
47+ 2 (code selection)
= 49
34. Machine Environment
(8 nodes of the Xeon Phi)
The Intel Xeon Phi
Xeon Phi 5110P (1.053 GHz), 60 cores
Memory Amount:8 GB (GDDR5)
Theoretical Peak Performance:1.01 TFLOPS
One board per node of the Xeon phi cluster
InfiniBand FDR x 2 Ports
Mellanox Connect‐IB
PCI‐E Gen3 x16
56Gbps x 2
Theoretical Peak bandwidth 13.6GB/s
Full‐Bisection
Intel MPI
Based on MPICH2, MVAPICH2
Version 5.0 Update 3 Build 20150128
Compiler:Intel Fortran version 15.0.3 20150407
Compiler Options:
‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel –mmic
‐align array64byte
KMP_AFFINITY=granularity=fine, balanced (Uniform Distribution of threads
between sockets)
35. Execution Details
• ppOpen‐APPL/FDM ver.0.2
• ppOpen‐AT ver.0.2
• The number of time step: 2000 steps
• The number of nodes: 8 node
• Native Mode Execution
• Target Problem Size
(Almost maximum size with 8 GB/node)
– NX * NY * NZ = 1536 x 768 x 240 / 8 Node
– NX * NY * NZ = 768 * 384 * 120 / node
(!= per MPI Process)
• The number of iterations for kernels
to do auto‐tuning: 100
36. 0
200
400
600
800
1000
1200
1400
P8T240 P16T120 P32T60 P64T30 P128T15 P240T8 P480T4
Others
IO
passing_stre
ss
passing_velo
city
update_vel_
sponge
update_vel
update_stre
ss_sponge
update_stre
ss
Whole Time (2000 time steps)
[Seconds]
Comp.
Time
Comm.
Time
NX*NY*NZ = 1536x768x240/ 8 Node
Hybrid MPI/OpenMP
Phi : 8 Nodes (1920 Threads)
Original Code
AT without code selection
Full AT
43. Execution Details
• ppOpen‐APPL/FDM ver.0.2
• ppOpen‐AT ver.0.2
• The number of time step: 2000 steps
• The number of nodes: 8 node
• Target Problem Size (Almost maximum size with 8 GB/node)
– NX * NY * NZ = 1536 x 768 x 240 / 8 Node
– NX * NY * NZ = 768 * 384 * 120 / node (!= per MPI Process)
• Target MPI Processes and Threads on the Xeon Phi
– 1 node of the Ivy Bridge with 2 HT (Hyper Threading)
– PXTY: X MPI Processes and Y Threads per process
– P8T20 : Minimum Hybrid MPI‐OpenMP execution for ppOpen‐APPL/FDM,
since it needs minimum 8 MPI Processes.
– P16T10
– P32T5
– P40T4
– P80T2
– P160T1 : Pure MPI execution with 8 Nodes
• The number of iterations for the kernels: 100
45. 0
500
1000
1500
2000
2500
P8T20 P16T10 P32T5 P40T4 P80T2 P160T1
Others
IO
passing_stress
passing_veloci
ty
update_vel_s
ponge
update_vel
update_stress
_sponge
update_stress
Whole Time (2000 Time Steps)[Seconds]
Comp.
Time
Comm.
Time
Original Code
AT without code selection
Full AT
52. Oakleaf‐FX (ITC, U.Tokyo),
The Fujitsu PRIMEHPC FX10
Contents Specifications
Whole
System
Total Performance 1.135 PFLOPS
Total Memory Amounts 150 TB
Total #nodes 4,800
Inter Connection
The TOFU
(6 Dimension
Mesh / Torus)
Local File System Amounts 1.1 PB
Shared File System Amounts 2.1 PB
Contents Specifications
Node
Theoretical Peak Performance 236.5 GFlops
#Processors (#Cores) 16
Main Memory Amounts 32 GB
Processor
Processor Name SPARC64 IX‐fx
Frequency 1.848 GHz
Theoretical Peak Performance (Core) 14.78 GFLOPS
4800 Nodes (76,800 Cores)
53. Execution Details
• ppOpen‐APPL/FDM ver.0.2
• ppOpen‐AT ver.0.2
• The number of time step: 2000 steps
• The number of nodes: 8 node
• Target Problem Size (Almost maximum size with 8 GB/node)
– NX * NY * NZ = 1536 x 768 x 240 / 8 Node
– NX * NY * NZ = 768 * 384 * 120 / node (!= per MPI Process)
• Target MPI Processes and Threads on the Xeon Phi
– 1 node of the Ivy Bridge with 2 HT (Hyper Threading)
– PXTY: X MPI Processes and Y Threads per process
– P8T16 : Minimum Hybrid MPI‐OpenMP execution for ppOpen‐
APPL/FDM, since it needs minimum 8 MPI Processes.
– P16T8
– P32T4
– P64T2
– P128T1 : pure MPI
• The number of iterations for the kernels: 100
54. 0
100
200
300
400
500
600
700
800
900
1000
P8T16 P16T8 P32T4 P64T2 P128T1
Others
IO
passing_stress
passing_velocit
y
update_vel_sp
onge
update_vel
update_stress_
sponge
update_stress
Whole Time (2000 time steps)
[Seconds]
Comp.
Time
Comm.
Time
Original Code
AT using scalar CPU code
Full AT
67. 参考文献 (著者関連のみ)
1. H. Kuroda, T. Katagiri, M. Kudoh, Y. Kanada, “ILIB_GMRES: An auto‐tuning parallel
iterative solver for linear equations,” SC2001, 2001. (Poster.)
2. T. Katagiri, K. Kise, H. Honda, T. Yuba, “ABCLib_DRSSED: A parallel eigensolver with an
auto‐tuning facility,” Parallel Computing, Vol. 32, Issue 3, pp. 231–250, 2006.
3. T. Sakurai, T. Katagiri, K. Naono, H. Kuroda, K. Nakajima, M. Igai, S. Ohshima, S. Itoh,
“Evaluation of auto‐tuning function on OpenATLib,” IPSJ SIG Technical Reports, Vol.
2011‐HPC‐130, No. 43, pp. 1–6, 2011. (in Japanese)
4. T. Katagiri, K. Kise, H. Honda, T. Yuba, “FIBER: A general framework for auto‐tuning
software,” The Fifth International Symposium on High Performance Computing
(ISHPC‐V), Springer LNCS 2858, pp. 146–159, 2003.
5. T. Katagiri, S. Ito, S. Ohshima, “Early experiences for adaptation of auto‐tuning by
ppOpen‐AT to an explicit method,” Special Session: Auto‐Tuning for Multicore and
GPU (ATMG) (In Conjunction with the IEEE MCSoC‐13), Proceedings of MCSoC‐13,
2013.
6. T. Katagiri, S. Ohshima, M. Matsumoto, “Auto‐tuning of computation kernels from an
FDM Code with ppOpen‐AT,” Special Session: Auto‐Tuning for Multicore and GPU
(ATMG) (In Conjunction with the IEEE MCSoC‐14), Proceedings of MCSoC‐14, 2014.
7. T.Katagiri, S.Ohshima, M. Matsumoto, "Directive‐based Auto‐tuning for the Finite
Difference Method on the Xeon Phi," The Tenth International Workshop on Automatic
Performance Tuning (iWAPT2015) (In Conjunction with the IEEE IPDPS2015 ),
Proceedings of IPDPSW2015, pp.1221‐1230,2015.