29. September bis 4. Oktober 2013, Dagstuhl Seminar 13401, Automatic Application Tuning for HPC Architectures, Session: infrastructures, 10:30-11:00, October 1st (TUE) , 2013.
ppOpen-AT : Yet Another Directive-base AT Language
1. ppOpen-AT :
Yet Another Directive-base AT
Language
Takahiro Katagiri,
Supercomputing Research Division,
Information Technology Center,
The University of Tokyo
1
29. September bis 4. Oktober 2013, Dagstuhl Seminar 13401
Automatic Application Tuning for HPC Architectures
Session: infrastructures, 10:30-11:00, October 1st (TUE) , 2013.
Collaborators:
Satoshi Ohshima, Masaharu Matsumoto
(Information Technology Center, The University of Tokyo)
3. Performance Portability (PP)
7
Keeping high performance in multiple computer
environments.
◦ Not only multiple CPUs, but also multiple compilers.
◦ Run-time information, such as loop length and
number of threads, is important.
Auto-tuning (AT) is one of candidates technologies to
establish PP in multiple computer environments.
4. Questions
Are open AT infrastructures, including numerical
libraries with AT, available for supercomputers in
operation?
We should consider with:
◦ Is run-time code generator of AT available for
login-nodes with low-overheads,
and available for dedicated batch-job systems?
Need to take care about different venders, such as Fujitsu, NEC,
Hitachi, Cray, etc..
◦ Are required software-stacks available for
the systems?
Scripting languages, such as python, perl, etc.
In some Japanese supercomputers, very limited script languages are
supported.
Dedicated compiler, such as CAPS, etc. 8
5. Questions (Cont’d)
We should consider with:
◦ Do AT systems require special daemons
or OS kernel modifications?
Additional daemons are not permitted to
prevent high-loads of login-nodes in
supercomputer.
OS kernel modification is not permitted
to keep support contract by venders.
It is more desirable that
all executions for AT perform in user level.
9
7. ppOpen-HPC (1/3)
• Open Source Infrastructure for development and
execution of large-scale scientific applications on post-
peta-scale supercomputers with automatic tuning (AT)
• “pp” : post-peta-scale
• Five-year project (FY.2011-2015) (since April 2011)
• P.I.: Kengo Nakajima (ITC, The University of Tokyo)
• Part of “Development of System Software Technologies for
Post-Peta Scale High Performance Computing” funded by
JST/CREST (Japan Science and Technology Agency, Core
Research for Evolutional Science and Technology)
• 4.5 M$ for 5 yr.
• Team with 6 institutes, >30 people (5 PDs) from
various fields: Co-Desigin
• ITC/U.Tokyo, AORI/U.Tokyo, ERI/U.Tokyo, FS/U.Tokyo
• Kyoto U., JAMSTEC
11
8. ppOpen-HPC (2/3)
• Source code developed on a PC with a single
processor is linked with these libraries, and generated
parallel code is optimized for post-peta scale system.
• Users don’t have to worry about optimization tuning,
parallelization etc.
• CUDA, OpenGL etc. are hidden.
• Part of MPI codes are also hidden.
• OpenMP, OpenACC could be hidden
– ppOpen-HPC consists of various types of optimized
libraries, which covers various types of procedures for
scientific computations.
• FEM, FDM, FVM, BEM, DEM
12OPL@SC12
13. Target Application
Seism_3D:
Simulation for seismic wave analysis.
Developed by Professor Furumura
at the University of Tokyo.
◦ The code is re-constructed as
ppOpen-APPL/FDM.
Finite Differential Method (FDM)
3D simulation
◦ 3D arrays are allocated.
Data type: Single Precision (real*4)
25
14. An Example of Seism_3D Simulation
West part earthquake in Tottori prefecture in Japan
at year 2000. ([1], pp.14)
The region of 820km x 410km x 128 km is discretized with 0.4km.
NX x NY x NZ = 2050 x 1025 x 320 ≒ 6.4 : 3.2 : 1.
[1] T. Furumura, “Large-scale Parallel FDM Simulation for Seismic Waves and Strong Shaking”, Supercomputing News,
Information Technology Center, The University of Tokyo, Vol.11, Special Edition 1, 2009. In Japanese.
Figure : Seismic wave translations in west part earthquake in Tottori prefecture in Japan.
(a) Measured waves; (b) Simulation results; (Reference : [1] in pp.13)
15. The Heaviest Loop(10%~20% to Total Time)
27
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
RL = LAM (I,J,K)
RM = RIG (I,J,K)
RM2 = RM + RM
RLTHETA = (DXVX(I,J,K)+DYVY(I,J,K)+DZVZ(I,J,K))*RL
QG = ABSX(I)*ABSY(J)*ABSZ(K)*Q(I,J,K)
SXX (I,J,K) = ( SXX (I,J,K)+ (RLTHETA + RM2*DXVX(I,J,K))*DT )*QG
SYY (I,J,K) = ( SYY (I,J,K)+ (RLTHETA + RM2*DYVY(I,J,K))*DT )*QG
SZZ (I,J,K) = ( SZZ (I,J,K) + (RLTHETA + RM2*DZVZ(I,J,K))*DT )*QG
RMAXY = 4.0/(1.0/RIG(I,J,K) + 1.0/RIG(I+1,J,K) + 1.0/RIG(I,J+1,K) + 1.0/RIG(I+1,J+1,K))
RMAXZ = 4.0/(1.0/RIG(I,J,K) + 1.0/RIG(I+1,J,K) + 1.0/RIG(I,J,K+1) + 1.0/RIG(I+1,J,K+1))
RMAYZ = 4.0/(1.0/RIG(I,J,K) + 1.0/RIG(I,J+1,K) + 1.0/RIG(I,J,K+1) + 1.0/RIG(I,J+1,K+1))
SXY (I,J,K) = ( SXY (I,J,K) + (RMAXY*(DXVY(I,J,K)+DYVX(I,J,K)))*DT) * QG
SXZ (I,J,K) = ( SXZ (I,J,K) + (RMAXZ*(DXVZ(I,J,K)+DZVX(I,J,K)))*DT) * QG
SYZ (I,J,K) = ( SYZ (I,J,K) + (RMAYZ*(DYVZ(I,J,K)+DZVY(I,J,K)))*DT) * QG
END DO
END DO
END DO
Flow Dependencies
16. New ppOpen-AT Directives
- Loop Split & Fusion with data-flow dependence
33
!oat$ install LoopFusionSplit region start
!$omp parallel do private(k,j,i,STMP1,STMP2,STMP3,STMP4,RL,RM,RM2,RMAXY,RMAXZ,RMAYZ,RLTHETA,QG)
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
RL = LAM (I,J,K); RM = RIG (I,J,K); RM2 = RM + RM
RLTHETA = (DXVX(I,J,K)+DYVY(I,J,K)+DZVZ(I,J,K))*RL
!oat$ SplitPointCopyDef region start
QG = ABSX(I)*ABSY(J)*ABSZ(K)*Q(I,J,K)
!oat$ SplitPointCopyDef region end
SXX (I,J,K) = ( SXX (I,J,K) + (RLTHETA + RM2*DXVX(I,J,K))*DT )*QG
SYY (I,J,K) = ( SYY (I,J,K) + (RLTHETA + RM2*DYVY(I,J,K))*DT )*QG
SZZ (I,J,K) = ( SZZ (I,J,K) + (RLTHETA + RM2*DZVZ(I,J,K))*DT )*QG
!oat$ SplitPoint (K, J, I)
STMP1 = 1.0/RIG(I,J,K); STMP2 = 1.0/RIG(I+1,J,K); STMP4 = 1.0/RIG(I,J,K+1)
STMP3 = STMP1 + STMP2
RMAXY = 4.0/(STMP3 + 1.0/RIG(I,J+1,K) + 1.0/RIG(I+1,J+1,K))
RMAXZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I+1,J,K+1))
RMAYZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I,J+1,K+1))
!oat$ SplitPointCopyInsert
SXY (I,J,K) = ( SXY (I,J,K) + (RMAXY*(DXVY(I,J,K)+DYVX(I,J,K)))*DT )*QG
SXZ (I,J,K) = ( SXZ (I,J,K) + (RMAXZ*(DXVZ(I,J,K)+DZVX(I,J,K)))*DT )*QG
SYZ (I,J,K) = ( SYZ (I,J,K) + (RMAYZ*(DYVZ(I,J,K)+DZVY(I,J,K)))*DT )*QG
END DO; END DO; END DO
!$omp end parallel do
!oat$ install LoopFusionSplit region end
Re-calculation
is defined in here.
Using the re-calculation
is defined in here.
Loop Split Point
17. Candidates of Auto-generated Codes
#1 [Baseline]: Original 3-nested Loop
#2 [Split]: Loop Splitting with I-loop
#3 [Split]: Loop Splitting with J-loop
#4 [Split]: Loop Splitting with K-loop
(Separated, two 3-nested loops)
#5 [Split&Fusion]: Loop Fusion with #2
(2-nested loop)
#6 [Fusion]: Loop Fusion with #1
(loop collapse)
#7 [Fusion]: Loop Fusion with #1
(2-nested loop) 34
18. Overview
1. Background and ppOpen-HPC
Project
2. ppOpen-AT Basics
3. Adaptation to an FDM
Application
4. Performance Evaluation
5. Conclusion
35
19. PERFORMANCE EVALUATION
WITH
PPOPEN-APPL/FDM
IN ALPHA VERSION
36
Takahiro Katagiri, Satoshi Ito, Satoshi Ohshima,
"Early Experiences for Adaptation of Auto-tuning by ppOpen-AT to an Explicit Method”
Special Session: Auto-Tuning for Multicore and GPU (ATMG)
(In Conjunction with the IEEE MCSoC-13), National Institute of Informatics,
Tokyo, Japan, September 26-28, 2013
25. Current Answers to AT systems
Minimum software-stack
requirement is important to use
AT facility in supercomputers in
operation.
Since we have no standardization
for AT functions, efforts for AT
with full user-level execution are
required.
51
26. Future Direction
The standardization of AT functions for
supercomputers is important future direction,
such as:
◦ Performance monitors.
◦ Code generators, esp. dynamic code generators.
◦ Job schedulers, such as batch-job systems.
◦ Compiler optimizations including directives and compiler
options.
◦ Defining AT targets, such as execution speed, memory
amounts, or power consumption, etc..
◦ etc.
Making standardization strategy for AT functions
with venders is important.
◦ Message Passing Interface (MPI) standardization in MPI
Forum is one of success examples for the
standardization.
◦ Why not make standardization and forum for AT? 52