SlideShare a Scribd company logo
1 of 26
Download to read offline
ppOpen-AT :
Yet Another Directive-base AT
Language
Takahiro Katagiri,
Supercomputing Research Division,
Information Technology Center,
The University of Tokyo
1
29. September bis 4. Oktober 2013, Dagstuhl Seminar 13401
Automatic Application Tuning for HPC Architectures
Session: infrastructures, 10:30-11:00, October 1st (TUE) , 2013.
Collaborators:
Satoshi Ohshima, Masaharu Matsumoto
(Information Technology Center, The University of Tokyo)
QUESTIONS FOR
AT ON SUPERCOMPUTER
IN OPERATION
6
Performance Portability (PP)
7
 Keeping high performance in multiple computer
environments.
◦ Not only multiple CPUs, but also multiple compilers.
◦ Run-time information, such as loop length and
number of threads, is important.
 Auto-tuning (AT) is one of candidates technologies to
establish PP in multiple computer environments.
Questions
 Are open AT infrastructures, including numerical
libraries with AT, available for supercomputers in
operation?
 We should consider with:
◦ Is run-time code generator of AT available for
login-nodes with low-overheads,
and available for dedicated batch-job systems?
 Need to take care about different venders, such as Fujitsu, NEC,
Hitachi, Cray, etc..
◦ Are required software-stacks available for
the systems?
 Scripting languages, such as python, perl, etc.
 In some Japanese supercomputers, very limited script languages are
supported.
 Dedicated compiler, such as CAPS, etc. 8
Questions (Cont’d)
 We should consider with:
◦ Do AT systems require special daemons
or OS kernel modifications?
 Additional daemons are not permitted to
prevent high-loads of login-nodes in
supercomputer.
 OS kernel modification is not permitted
to keep support contract by venders.
 It is more desirable that
all executions for AT perform in user level.
9
RELATED PROJECT
10
ppOpen-HPC (1/3)
• Open Source Infrastructure for development and
execution of large-scale scientific applications on post-
peta-scale supercomputers with automatic tuning (AT)
• “pp” : post-peta-scale
• Five-year project (FY.2011-2015) (since April 2011)
• P.I.: Kengo Nakajima (ITC, The University of Tokyo)
• Part of “Development of System Software Technologies for
Post-Peta Scale High Performance Computing” funded by
JST/CREST (Japan Science and Technology Agency, Core
Research for Evolutional Science and Technology)
• 4.5 M$ for 5 yr.
• Team with 6 institutes, >30 people (5 PDs) from
various fields: Co-Desigin
• ITC/U.Tokyo, AORI/U.Tokyo, ERI/U.Tokyo, FS/U.Tokyo
• Kyoto U., JAMSTEC
11
ppOpen-HPC (2/3)
• Source code developed on a PC with a single
processor is linked with these libraries, and generated
parallel code is optimized for post-peta scale system.
• Users don’t have to worry about optimization tuning,
parallelization etc.
• CUDA, OpenGL etc. are hidden.
• Part of MPI codes are also hidden.
• OpenMP, OpenACC could be hidden
– ppOpen-HPC consists of various types of optimized
libraries, which covers various types of procedures for
scientific computations.
• FEM, FDM, FVM, BEM, DEM
12OPL@SC12
ppOpen-HPC covers …
13
PPOPEN-AT
BASICS
19
ppOpen‐AT System
ppOpen‐APPL /*
ppOpen‐AT
Directives
User 
KnowledgeLibrary 
Developer
① Before 
Release‐time
Candidate
1
Candidate
2
Candidate
3
Candidate
nppOpen‐AT
Auto‐Tuner
ppOpen‐APPL / *
Automatic
Code
Generation②
:Target 
Computers
Execution Time④
Library User
③
Library Call
Selection
⑤
⑥
Auto‐tuned
Kernel
Execution
Run‐
time
EARLY EXPERIENCE IN
EXPLICIT METHOD
(FINITE DIFFERENCE
METHOD)
24
Target Application
Seism_3D:
Simulation for seismic wave analysis.
 Developed by Professor Furumura
at the University of Tokyo.
◦ The code is re-constructed as
ppOpen-APPL/FDM.
 Finite Differential Method (FDM)
 3D simulation
◦ 3D arrays are allocated.
 Data type: Single Precision (real*4)
25
An Example of Seism_3D Simulation
 West part earthquake in Tottori prefecture in Japan
at year 2000. ([1], pp.14)
 The region of 820km x 410km x 128 km is discretized with 0.4km.
 NX x NY x NZ = 2050 x 1025 x 320 ≒ 6.4 : 3.2 : 1.
[1] T. Furumura, “Large-scale Parallel FDM Simulation for Seismic Waves and Strong Shaking”, Supercomputing News,
Information Technology Center, The University of Tokyo, Vol.11, Special Edition 1, 2009. In Japanese.
Figure : Seismic wave translations in west part earthquake in Tottori prefecture in Japan.
(a) Measured waves; (b) Simulation results; (Reference : [1] in pp.13)
The Heaviest Loop(10%~20% to Total Time)
27
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
RL = LAM (I,J,K)
RM = RIG (I,J,K)
RM2 = RM + RM
RLTHETA = (DXVX(I,J,K)+DYVY(I,J,K)+DZVZ(I,J,K))*RL
QG = ABSX(I)*ABSY(J)*ABSZ(K)*Q(I,J,K)
SXX (I,J,K) = ( SXX (I,J,K)+ (RLTHETA + RM2*DXVX(I,J,K))*DT )*QG
SYY (I,J,K) = ( SYY (I,J,K)+ (RLTHETA + RM2*DYVY(I,J,K))*DT )*QG
SZZ (I,J,K) = ( SZZ (I,J,K) + (RLTHETA + RM2*DZVZ(I,J,K))*DT )*QG
RMAXY = 4.0/(1.0/RIG(I,J,K) + 1.0/RIG(I+1,J,K) + 1.0/RIG(I,J+1,K) + 1.0/RIG(I+1,J+1,K))
RMAXZ = 4.0/(1.0/RIG(I,J,K) + 1.0/RIG(I+1,J,K) + 1.0/RIG(I,J,K+1) + 1.0/RIG(I+1,J,K+1))
RMAYZ = 4.0/(1.0/RIG(I,J,K) + 1.0/RIG(I,J+1,K) + 1.0/RIG(I,J,K+1) + 1.0/RIG(I,J+1,K+1))
SXY (I,J,K) = ( SXY (I,J,K) + (RMAXY*(DXVY(I,J,K)+DYVX(I,J,K)))*DT) * QG
SXZ (I,J,K) = ( SXZ (I,J,K) + (RMAXZ*(DXVZ(I,J,K)+DZVX(I,J,K)))*DT) * QG
SYZ (I,J,K) = ( SYZ (I,J,K) + (RMAYZ*(DYVZ(I,J,K)+DZVY(I,J,K)))*DT) * QG
END DO
END DO
END DO
Flow Dependencies
New ppOpen-AT Directives
- Loop Split & Fusion with data-flow dependence
33
!oat$ install LoopFusionSplit region start
!$omp parallel do private(k,j,i,STMP1,STMP2,STMP3,STMP4,RL,RM,RM2,RMAXY,RMAXZ,RMAYZ,RLTHETA,QG)
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
RL = LAM (I,J,K); RM = RIG (I,J,K); RM2 = RM + RM
RLTHETA = (DXVX(I,J,K)+DYVY(I,J,K)+DZVZ(I,J,K))*RL
!oat$ SplitPointCopyDef region start
QG = ABSX(I)*ABSY(J)*ABSZ(K)*Q(I,J,K)
!oat$ SplitPointCopyDef region end
SXX (I,J,K) = ( SXX (I,J,K) + (RLTHETA + RM2*DXVX(I,J,K))*DT )*QG
SYY (I,J,K) = ( SYY (I,J,K) + (RLTHETA + RM2*DYVY(I,J,K))*DT )*QG
SZZ (I,J,K) = ( SZZ (I,J,K) + (RLTHETA + RM2*DZVZ(I,J,K))*DT )*QG
!oat$ SplitPoint (K, J, I)
STMP1 = 1.0/RIG(I,J,K); STMP2 = 1.0/RIG(I+1,J,K); STMP4 = 1.0/RIG(I,J,K+1)
STMP3 = STMP1 + STMP2
RMAXY = 4.0/(STMP3 + 1.0/RIG(I,J+1,K) + 1.0/RIG(I+1,J+1,K))
RMAXZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I+1,J,K+1))
RMAYZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I,J+1,K+1))
!oat$ SplitPointCopyInsert
SXY (I,J,K) = ( SXY (I,J,K) + (RMAXY*(DXVY(I,J,K)+DYVX(I,J,K)))*DT )*QG
SXZ (I,J,K) = ( SXZ (I,J,K) + (RMAXZ*(DXVZ(I,J,K)+DZVX(I,J,K)))*DT )*QG
SYZ (I,J,K) = ( SYZ (I,J,K) + (RMAYZ*(DYVZ(I,J,K)+DZVY(I,J,K)))*DT )*QG
END DO; END DO; END DO
!$omp end parallel do
!oat$ install LoopFusionSplit region end
Re-calculation
is defined in here.
Using the re-calculation
is defined in here.
Loop Split Point
Candidates of Auto-generated Codes
 #1 [Baseline]: Original 3-nested Loop
 #2 [Split]: Loop Splitting with I-loop
 #3 [Split]: Loop Splitting with J-loop
 #4 [Split]: Loop Splitting with K-loop
(Separated, two 3-nested loops)
 #5 [Split&Fusion]: Loop Fusion with #2
(2-nested loop)
 #6 [Fusion]: Loop Fusion with #1
(loop collapse)
 #7 [Fusion]: Loop Fusion with #1
(2-nested loop) 34
Overview
1. Background and ppOpen-HPC
Project
2. ppOpen-AT Basics
3. Adaptation to an FDM
Application
4. Performance Evaluation
5. Conclusion
35
PERFORMANCE EVALUATION
WITH
PPOPEN-APPL/FDM
IN ALPHA VERSION
36
Takahiro Katagiri, Satoshi Ito, Satoshi Ohshima,
"Early Experiences for Adaptation of Auto-tuning by ppOpen-AT to an Explicit Method”
Special Session: Auto-Tuning for Multicore and GPU (ATMG)
(In Conjunction with the IEEE MCSoC-13), National Institute of Informatics,
Tokyo, Japan, September 26-28, 2013
Test Environments
1. FX10 (The Fujitsu PRIMEHPC FX10)
◦ SPARC64 IXfx(1.848 GHz), 16 Cores, Maximum 16 Threads.
◦ Fujitsu Fortran Compiler, Version 1.2.1.
◦ Option:-Kfast, -openmp.
2. T2K (The AMD Quad-core Opteron (Barcelona))
◦ AMD Opteron 8356 (2.3 GHz),16 Cores (4 Sockets),Maximum 16 Threads
◦ Intel Fortran Compiler, Version 11.0.
◦ Option:-fast openmp -mcmodel=medium.
3. Sandy Bridge (Intel Sandy Bridge)
◦ Xeon E5 (Sandy Bridge E5-2687W),(8 Physical Cores, 16 Threads) (3.1
GHz),(Turbo boost off),32 Cores (2 Sockets),Maximum 32 Threads.
◦ Intel Fortran Compiler, Version 12.1.
◦ Option:-fast –openmp -mcmodel=medium.
4. SR16K (HITACHI SR16000/M1)
◦ IBM Power7 (3.83 GHz),32 Cores (4 Sockets),Maximum 64 Threads (SMT)
◦ HITACHI Optimization Fortran,Version. 03-01-/B.
◦ Option: -opt=ss –omp. 37
AT Effect: Very Small and Small
0
2
4
6
8
10
1 4 8 16
#1 #2 #3 #4 #5 #6 #7
39
(A) FX10 (VERY SMALL, #REPEAT = 100,000)
#Threads
Time
In Seconds
0
2
4
6
8
10
1 4 8 16
#1 #2 #3 #4 #5 #6 #7
(B)T2K (VERY SMALL, #REPEAT = 100,000)
0
0.1
0.2
0.3
0.4
0.5
1 8 16 32
#1 #2 #3 #4 #5 #6 #7
#Threads
Time
In Seconds
(C)SANDY BRIDGE (SMALL, #REPEAT = 1,000)
0
0.1
0.2
0.3
0.4
0.5
1 8 32 64
#1 #2 #3 #4 #5 #6 #7
(D)SR16K (SMALL, #REPEAT = 1,000)
#2, #5 are the best.
#4, #5, #7 are the best.
#2, #3, #4, #5 are the best.#2, #4, #5 are the best.
#5 and #7 were the best
when the number of threads was increase.
AT Effect: Large Size
0
2
4
6
8
10
12
1 4 8 16
#1 #2 #3 #4 #5 #6 #7
41
(A) FX10 (#REPEAT = 10)
#Threads
Time
In Seconds
0
1
2
3
4
5
6
1 4 8 16
#1 #2 #3 #4 #5 #6 #7
(B)T2K (#REPEAT = 10)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 8 16 32
#1 #2 #3 #4 #5 #6 #7
#Threads
Time
In Seconds
(C)SANDY BRIDGE (#REPEAT = 10)
0
0.2
0.4
0.6
0.8
1
1 8 32 64
#1 #2 #3 #4 #5 #6 #7
(D)SR16K (#REPEAT = 10)
#2, #3, #5 are the best.#2, #7 are the best.
#5 are the best.
#4 are the best.
One fixed implementation was the best.
With AT(Speedups to the case without AT)
Pure MPI
Types of hybrid MPI‐OpenMP Execution
2.5
AT Effect for Hybrid OpenMP‐MPI 
Original without AT
Pure MPI
Speedup to pure MPI Execution
Types of hybrid MPI‐OpenMP Execution
The FX10, Kernel: update_stress
1
No merit for 
Hybrid MPI‐OpenMPI Executions. 1
Effect on pure MPI Execution
Gain by using MPI‐OpenMPI Executions.
By adapting loop transformation from the AT, we obtained:
 Maximum 1.5x speedup to pure MPI (without Thread execution)
 Maximum 2.5x speedup to pure MPI in hybrid MPI‐OpenMP execution.
PXTY :X Processes, Y Threads / Process
ANSWER
AND
PLANS FOR THE FUTURE
50
Current Answers to AT systems
Minimum software-stack
requirement is important to use
AT facility in supercomputers in
operation.
Since we have no standardization
for AT functions, efforts for AT
with full user-level execution are
required.
51
Future Direction
 The standardization of AT functions for
supercomputers is important future direction,
such as:
◦ Performance monitors.
◦ Code generators, esp. dynamic code generators.
◦ Job schedulers, such as batch-job systems.
◦ Compiler optimizations including directives and compiler
options.
◦ Defining AT targets, such as execution speed, memory
amounts, or power consumption, etc..
◦ etc.
 Making standardization strategy for AT functions
with venders is important.
◦ Message Passing Interface (MPI) standardization in MPI
Forum is one of success examples for the
standardization.
◦ Why not make standardization and forum for AT? 52

More Related Content

What's hot

A Simple Design to Mitigate Problems of Conventional Digital Phase Locked Loop
A Simple Design to Mitigate Problems of Conventional Digital Phase Locked LoopA Simple Design to Mitigate Problems of Conventional Digital Phase Locked Loop
A Simple Design to Mitigate Problems of Conventional Digital Phase Locked LoopCSCJournals
 
Tele4653 l4
Tele4653 l4Tele4653 l4
Tele4653 l4Vin Voro
 
B Eng Final Year Project Presentation
B Eng Final Year Project PresentationB Eng Final Year Project Presentation
B Eng Final Year Project Presentationjesujoseph
 
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016 OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016 otoyinc
 
PyTorch for Deep Learning Practitioners
PyTorch for Deep Learning PractitionersPyTorch for Deep Learning Practitioners
PyTorch for Deep Learning PractitionersBayu Aldi Yansyah
 
FPGA FIR filter implementation (Audio signal processing)
FPGA FIR filter implementation (Audio signal processing)FPGA FIR filter implementation (Audio signal processing)
FPGA FIR filter implementation (Audio signal processing)Hocine Merabti
 
Artificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimizationArtificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimizationMohammed Bennamoun
 
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverS4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverPraveen Narayanan
 
Mining of time series data base using fuzzy neural information systems
Mining of time series data base using fuzzy neural information systemsMining of time series data base using fuzzy neural information systems
Mining of time series data base using fuzzy neural information systemsDr.MAYA NAYAK
 
Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Nabil Chouba
 
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPUAcceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPUMahesh Khadatare
 
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationHidekazu Oiwa
 
Neural Network - Feed Forward - Back Propagation Visualization
Neural Network - Feed Forward - Back Propagation VisualizationNeural Network - Feed Forward - Back Propagation Visualization
Neural Network - Feed Forward - Back Propagation VisualizationTraian Morar
 
Tele4653 l5
Tele4653 l5Tele4653 l5
Tele4653 l5Vin Voro
 
SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...Kazuki Fujikawa
 

What's hot (19)

Cb32492496
Cb32492496Cb32492496
Cb32492496
 
A Simple Design to Mitigate Problems of Conventional Digital Phase Locked Loop
A Simple Design to Mitigate Problems of Conventional Digital Phase Locked LoopA Simple Design to Mitigate Problems of Conventional Digital Phase Locked Loop
A Simple Design to Mitigate Problems of Conventional Digital Phase Locked Loop
 
Tele4653 l4
Tele4653 l4Tele4653 l4
Tele4653 l4
 
B Eng Final Year Project Presentation
B Eng Final Year Project PresentationB Eng Final Year Project Presentation
B Eng Final Year Project Presentation
 
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016 OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
 
PyTorch for Deep Learning Practitioners
PyTorch for Deep Learning PractitionersPyTorch for Deep Learning Practitioners
PyTorch for Deep Learning Practitioners
 
FPGA FIR filter implementation (Audio signal processing)
FPGA FIR filter implementation (Audio signal processing)FPGA FIR filter implementation (Audio signal processing)
FPGA FIR filter implementation (Audio signal processing)
 
Artificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimizationArtificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimization
 
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverS4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
 
Mining of time series data base using fuzzy neural information systems
Mining of time series data base using fuzzy neural information systemsMining of time series data base using fuzzy neural information systems
Mining of time series data base using fuzzy neural information systems
 
Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation
 
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPUAcceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
 
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
 
B010341317
B010341317B010341317
B010341317
 
Neural Network - Feed Forward - Back Propagation Visualization
Neural Network - Feed Forward - Back Propagation VisualizationNeural Network - Feed Forward - Back Propagation Visualization
Neural Network - Feed Forward - Back Propagation Visualization
 
Tele4653 l5
Tele4653 l5Tele4653 l5
Tele4653 l5
 
SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...
 
Krish final
Krish  finalKrish  final
Krish final
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 

Viewers also liked

Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...
Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...
Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...Takahiro Katagiri
 
ATTA2014基盤B導入(片桐)
ATTA2014基盤B導入(片桐)ATTA2014基盤B導入(片桐)
ATTA2014基盤B導入(片桐)Takahiro Katagiri
 
自動チューニングとビックデータ:機械学習の適用の可能性
自動チューニングとビックデータ:機械学習の適用の可能性自動チューニングとビックデータ:機械学習の適用の可能性
自動チューニングとビックデータ:機械学習の適用の可能性Takahiro Katagiri
 
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...Takahiro Katagiri
 
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...Takahiro Katagiri
 
ppOpen-ATによる静的コード生成で実現する 自動チューニング方式の評価
ppOpen-ATによる静的コード生成で実現する自動チューニング方式の評価ppOpen-ATによる静的コード生成で実現する自動チューニング方式の評価
ppOpen-ATによる静的コード生成で実現する 自動チューニング方式の評価Takahiro Katagiri
 
Auto‐Tuning of Hierarchical Computations with ppOpen‐AT
Auto‐Tuning of Hierarchical Computations with ppOpen‐ATAuto‐Tuning of Hierarchical Computations with ppOpen‐AT
Auto‐Tuning of Hierarchical Computations with ppOpen‐ATTakahiro Katagiri
 
SCG-AT:静的コード生成のみによる自動チューニング実現方式
SCG-AT:静的コード生成のみによる自動チューニング実現方式SCG-AT:静的コード生成のみによる自動チューニング実現方式
SCG-AT:静的コード生成のみによる自動チューニング実現方式Takahiro Katagiri
 
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開Takahiro Katagiri
 
ソフトウェア自動チューニング研究紹介
ソフトウェア自動チューニング研究紹介ソフトウェア自動チューニング研究紹介
ソフトウェア自動チューニング研究紹介Takahiro Katagiri
 

Viewers also liked (12)

Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...
Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...
Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...
 
iWAPT2015_katagiri
iWAPT2015_katagiriiWAPT2015_katagiri
iWAPT2015_katagiri
 
ATTA2014基盤B導入(片桐)
ATTA2014基盤B導入(片桐)ATTA2014基盤B導入(片桐)
ATTA2014基盤B導入(片桐)
 
自動チューニングとビックデータ:機械学習の適用の可能性
自動チューニングとビックデータ:機械学習の適用の可能性自動チューニングとビックデータ:機械学習の適用の可能性
自動チューニングとビックデータ:機械学習の適用の可能性
 
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...
 
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...
 
Ase20 20151016 hp
Ase20 20151016 hpAse20 20151016 hp
Ase20 20151016 hp
 
ppOpen-ATによる静的コード生成で実現する 自動チューニング方式の評価
ppOpen-ATによる静的コード生成で実現する自動チューニング方式の評価ppOpen-ATによる静的コード生成で実現する自動チューニング方式の評価
ppOpen-ATによる静的コード生成で実現する 自動チューニング方式の評価
 
Auto‐Tuning of Hierarchical Computations with ppOpen‐AT
Auto‐Tuning of Hierarchical Computations with ppOpen‐ATAuto‐Tuning of Hierarchical Computations with ppOpen‐AT
Auto‐Tuning of Hierarchical Computations with ppOpen‐AT
 
SCG-AT:静的コード生成のみによる自動チューニング実現方式
SCG-AT:静的コード生成のみによる自動チューニング実現方式SCG-AT:静的コード生成のみによる自動チューニング実現方式
SCG-AT:静的コード生成のみによる自動チューニング実現方式
 
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開
 
ソフトウェア自動チューニング研究紹介
ソフトウェア自動チューニング研究紹介ソフトウェア自動チューニング研究紹介
ソフトウェア自動チューニング研究紹介
 

Similar to ppOpen-AT : Yet Another Directive-base AT Language

Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)elliando dias
 
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term RewritingCS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term RewritingEelco Visser
 
Combinatorial testing in Japan
Combinatorial testing in JapanCombinatorial testing in Japan
Combinatorial testing in JapanKeizo Tatsumi
 
19. algorithms and-complexity
19. algorithms and-complexity19. algorithms and-complexity
19. algorithms and-complexityshowkat27
 
Efficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT ArchitectureEfficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT ArchitectureIJMER
 
Time and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptxTime and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptxdudelover
 
材料科学とスーパーコンピュータ: 基礎編
材料科学とスーパーコンピュータ: 基礎編材料科学とスーパーコンピュータ: 基礎編
材料科学とスーパーコンピュータ: 基礎編Michio Katouda
 
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...cscpconf
 
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...csandit
 
time_complexity_list_02_04_2024_22_pages.pdf
time_complexity_list_02_04_2024_22_pages.pdftime_complexity_list_02_04_2024_22_pages.pdf
time_complexity_list_02_04_2024_22_pages.pdfSrinivasaReddyPolamR
 
Project report of ustos
Project report of ustosProject report of ustos
Project report of ustosMurali Mc
 
RTOS implementation
RTOS implementationRTOS implementation
RTOS implementationRajan Kumar
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to AlgorithmsVenkatesh Iyer
 
IP Core Design of Hight Lightweight Cipher and its Implementation
IP Core Design of Hight Lightweight Cipher and its Implementation IP Core Design of Hight Lightweight Cipher and its Implementation
IP Core Design of Hight Lightweight Cipher and its Implementation csandit
 
IP CORE DESIGN OF HIGHT LIGHTWEIGHT CIPHER AND ITS IMPLEMENTATION
IP CORE DESIGN OF HIGHT LIGHTWEIGHT CIPHER AND ITS IMPLEMENTATIONIP CORE DESIGN OF HIGHT LIGHTWEIGHT CIPHER AND ITS IMPLEMENTATION
IP CORE DESIGN OF HIGHT LIGHTWEIGHT CIPHER AND ITS IMPLEMENTATIONcscpconf
 
Embedded JavaScript
Embedded JavaScriptEmbedded JavaScript
Embedded JavaScriptJens Siebert
 
Course-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdfCourse-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdfShreeDevi42
 
Advanced_DSP_J_G_Proakis.pdf
Advanced_DSP_J_G_Proakis.pdfAdvanced_DSP_J_G_Proakis.pdf
Advanced_DSP_J_G_Proakis.pdfHariPrasad314745
 
Rtos princples adn case study
Rtos princples adn case studyRtos princples adn case study
Rtos princples adn case studyvanamali_vanu
 

Similar to ppOpen-AT : Yet Another Directive-base AT Language (20)

Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
 
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term RewritingCS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
 
Combinatorial testing in Japan
Combinatorial testing in JapanCombinatorial testing in Japan
Combinatorial testing in Japan
 
19. algorithms and-complexity
19. algorithms and-complexity19. algorithms and-complexity
19. algorithms and-complexity
 
Efficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT ArchitectureEfficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT Architecture
 
Time and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptxTime and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptx
 
材料科学とスーパーコンピュータ: 基礎編
材料科学とスーパーコンピュータ: 基礎編材料科学とスーパーコンピュータ: 基礎編
材料科学とスーパーコンピュータ: 基礎編
 
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
 
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
 
Interm codegen
Interm codegenInterm codegen
Interm codegen
 
time_complexity_list_02_04_2024_22_pages.pdf
time_complexity_list_02_04_2024_22_pages.pdftime_complexity_list_02_04_2024_22_pages.pdf
time_complexity_list_02_04_2024_22_pages.pdf
 
Project report of ustos
Project report of ustosProject report of ustos
Project report of ustos
 
RTOS implementation
RTOS implementationRTOS implementation
RTOS implementation
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithms
 
IP Core Design of Hight Lightweight Cipher and its Implementation
IP Core Design of Hight Lightweight Cipher and its Implementation IP Core Design of Hight Lightweight Cipher and its Implementation
IP Core Design of Hight Lightweight Cipher and its Implementation
 
IP CORE DESIGN OF HIGHT LIGHTWEIGHT CIPHER AND ITS IMPLEMENTATION
IP CORE DESIGN OF HIGHT LIGHTWEIGHT CIPHER AND ITS IMPLEMENTATIONIP CORE DESIGN OF HIGHT LIGHTWEIGHT CIPHER AND ITS IMPLEMENTATION
IP CORE DESIGN OF HIGHT LIGHTWEIGHT CIPHER AND ITS IMPLEMENTATION
 
Embedded JavaScript
Embedded JavaScriptEmbedded JavaScript
Embedded JavaScript
 
Course-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdfCourse-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdf
 
Advanced_DSP_J_G_Proakis.pdf
Advanced_DSP_J_G_Proakis.pdfAdvanced_DSP_J_G_Proakis.pdf
Advanced_DSP_J_G_Proakis.pdf
 
Rtos princples adn case study
Rtos princples adn case studyRtos princples adn case study
Rtos princples adn case study
 

Recently uploaded

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

ppOpen-AT : Yet Another Directive-base AT Language

  • 1. ppOpen-AT : Yet Another Directive-base AT Language Takahiro Katagiri, Supercomputing Research Division, Information Technology Center, The University of Tokyo 1 29. September bis 4. Oktober 2013, Dagstuhl Seminar 13401 Automatic Application Tuning for HPC Architectures Session: infrastructures, 10:30-11:00, October 1st (TUE) , 2013. Collaborators: Satoshi Ohshima, Masaharu Matsumoto (Information Technology Center, The University of Tokyo)
  • 2. QUESTIONS FOR AT ON SUPERCOMPUTER IN OPERATION 6
  • 3. Performance Portability (PP) 7  Keeping high performance in multiple computer environments. ◦ Not only multiple CPUs, but also multiple compilers. ◦ Run-time information, such as loop length and number of threads, is important.  Auto-tuning (AT) is one of candidates technologies to establish PP in multiple computer environments.
  • 4. Questions  Are open AT infrastructures, including numerical libraries with AT, available for supercomputers in operation?  We should consider with: ◦ Is run-time code generator of AT available for login-nodes with low-overheads, and available for dedicated batch-job systems?  Need to take care about different venders, such as Fujitsu, NEC, Hitachi, Cray, etc.. ◦ Are required software-stacks available for the systems?  Scripting languages, such as python, perl, etc.  In some Japanese supercomputers, very limited script languages are supported.  Dedicated compiler, such as CAPS, etc. 8
  • 5. Questions (Cont’d)  We should consider with: ◦ Do AT systems require special daemons or OS kernel modifications?  Additional daemons are not permitted to prevent high-loads of login-nodes in supercomputer.  OS kernel modification is not permitted to keep support contract by venders.  It is more desirable that all executions for AT perform in user level. 9
  • 7. ppOpen-HPC (1/3) • Open Source Infrastructure for development and execution of large-scale scientific applications on post- peta-scale supercomputers with automatic tuning (AT) • “pp” : post-peta-scale • Five-year project (FY.2011-2015) (since April 2011) • P.I.: Kengo Nakajima (ITC, The University of Tokyo) • Part of “Development of System Software Technologies for Post-Peta Scale High Performance Computing” funded by JST/CREST (Japan Science and Technology Agency, Core Research for Evolutional Science and Technology) • 4.5 M$ for 5 yr. • Team with 6 institutes, >30 people (5 PDs) from various fields: Co-Desigin • ITC/U.Tokyo, AORI/U.Tokyo, ERI/U.Tokyo, FS/U.Tokyo • Kyoto U., JAMSTEC 11
  • 8. ppOpen-HPC (2/3) • Source code developed on a PC with a single processor is linked with these libraries, and generated parallel code is optimized for post-peta scale system. • Users don’t have to worry about optimization tuning, parallelization etc. • CUDA, OpenGL etc. are hidden. • Part of MPI codes are also hidden. • OpenMP, OpenACC could be hidden – ppOpen-HPC consists of various types of optimized libraries, which covers various types of procedures for scientific computations. • FEM, FDM, FVM, BEM, DEM 12OPL@SC12
  • 12. EARLY EXPERIENCE IN EXPLICIT METHOD (FINITE DIFFERENCE METHOD) 24
  • 13. Target Application Seism_3D: Simulation for seismic wave analysis.  Developed by Professor Furumura at the University of Tokyo. ◦ The code is re-constructed as ppOpen-APPL/FDM.  Finite Differential Method (FDM)  3D simulation ◦ 3D arrays are allocated.  Data type: Single Precision (real*4) 25
  • 14. An Example of Seism_3D Simulation  West part earthquake in Tottori prefecture in Japan at year 2000. ([1], pp.14)  The region of 820km x 410km x 128 km is discretized with 0.4km.  NX x NY x NZ = 2050 x 1025 x 320 ≒ 6.4 : 3.2 : 1. [1] T. Furumura, “Large-scale Parallel FDM Simulation for Seismic Waves and Strong Shaking”, Supercomputing News, Information Technology Center, The University of Tokyo, Vol.11, Special Edition 1, 2009. In Japanese. Figure : Seismic wave translations in west part earthquake in Tottori prefecture in Japan. (a) Measured waves; (b) Simulation results; (Reference : [1] in pp.13)
  • 15. The Heaviest Loop(10%~20% to Total Time) 27 DO K = 1, NZ DO J = 1, NY DO I = 1, NX RL = LAM (I,J,K) RM = RIG (I,J,K) RM2 = RM + RM RLTHETA = (DXVX(I,J,K)+DYVY(I,J,K)+DZVZ(I,J,K))*RL QG = ABSX(I)*ABSY(J)*ABSZ(K)*Q(I,J,K) SXX (I,J,K) = ( SXX (I,J,K)+ (RLTHETA + RM2*DXVX(I,J,K))*DT )*QG SYY (I,J,K) = ( SYY (I,J,K)+ (RLTHETA + RM2*DYVY(I,J,K))*DT )*QG SZZ (I,J,K) = ( SZZ (I,J,K) + (RLTHETA + RM2*DZVZ(I,J,K))*DT )*QG RMAXY = 4.0/(1.0/RIG(I,J,K) + 1.0/RIG(I+1,J,K) + 1.0/RIG(I,J+1,K) + 1.0/RIG(I+1,J+1,K)) RMAXZ = 4.0/(1.0/RIG(I,J,K) + 1.0/RIG(I+1,J,K) + 1.0/RIG(I,J,K+1) + 1.0/RIG(I+1,J,K+1)) RMAYZ = 4.0/(1.0/RIG(I,J,K) + 1.0/RIG(I,J+1,K) + 1.0/RIG(I,J,K+1) + 1.0/RIG(I,J+1,K+1)) SXY (I,J,K) = ( SXY (I,J,K) + (RMAXY*(DXVY(I,J,K)+DYVX(I,J,K)))*DT) * QG SXZ (I,J,K) = ( SXZ (I,J,K) + (RMAXZ*(DXVZ(I,J,K)+DZVX(I,J,K)))*DT) * QG SYZ (I,J,K) = ( SYZ (I,J,K) + (RMAYZ*(DYVZ(I,J,K)+DZVY(I,J,K)))*DT) * QG END DO END DO END DO Flow Dependencies
  • 16. New ppOpen-AT Directives - Loop Split & Fusion with data-flow dependence 33 !oat$ install LoopFusionSplit region start !$omp parallel do private(k,j,i,STMP1,STMP2,STMP3,STMP4,RL,RM,RM2,RMAXY,RMAXZ,RMAYZ,RLTHETA,QG) DO K = 1, NZ DO J = 1, NY DO I = 1, NX RL = LAM (I,J,K); RM = RIG (I,J,K); RM2 = RM + RM RLTHETA = (DXVX(I,J,K)+DYVY(I,J,K)+DZVZ(I,J,K))*RL !oat$ SplitPointCopyDef region start QG = ABSX(I)*ABSY(J)*ABSZ(K)*Q(I,J,K) !oat$ SplitPointCopyDef region end SXX (I,J,K) = ( SXX (I,J,K) + (RLTHETA + RM2*DXVX(I,J,K))*DT )*QG SYY (I,J,K) = ( SYY (I,J,K) + (RLTHETA + RM2*DYVY(I,J,K))*DT )*QG SZZ (I,J,K) = ( SZZ (I,J,K) + (RLTHETA + RM2*DZVZ(I,J,K))*DT )*QG !oat$ SplitPoint (K, J, I) STMP1 = 1.0/RIG(I,J,K); STMP2 = 1.0/RIG(I+1,J,K); STMP4 = 1.0/RIG(I,J,K+1) STMP3 = STMP1 + STMP2 RMAXY = 4.0/(STMP3 + 1.0/RIG(I,J+1,K) + 1.0/RIG(I+1,J+1,K)) RMAXZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I+1,J,K+1)) RMAYZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I,J+1,K+1)) !oat$ SplitPointCopyInsert SXY (I,J,K) = ( SXY (I,J,K) + (RMAXY*(DXVY(I,J,K)+DYVX(I,J,K)))*DT )*QG SXZ (I,J,K) = ( SXZ (I,J,K) + (RMAXZ*(DXVZ(I,J,K)+DZVX(I,J,K)))*DT )*QG SYZ (I,J,K) = ( SYZ (I,J,K) + (RMAYZ*(DYVZ(I,J,K)+DZVY(I,J,K)))*DT )*QG END DO; END DO; END DO !$omp end parallel do !oat$ install LoopFusionSplit region end Re-calculation is defined in here. Using the re-calculation is defined in here. Loop Split Point
  • 17. Candidates of Auto-generated Codes  #1 [Baseline]: Original 3-nested Loop  #2 [Split]: Loop Splitting with I-loop  #3 [Split]: Loop Splitting with J-loop  #4 [Split]: Loop Splitting with K-loop (Separated, two 3-nested loops)  #5 [Split&Fusion]: Loop Fusion with #2 (2-nested loop)  #6 [Fusion]: Loop Fusion with #1 (loop collapse)  #7 [Fusion]: Loop Fusion with #1 (2-nested loop) 34
  • 18. Overview 1. Background and ppOpen-HPC Project 2. ppOpen-AT Basics 3. Adaptation to an FDM Application 4. Performance Evaluation 5. Conclusion 35
  • 19. PERFORMANCE EVALUATION WITH PPOPEN-APPL/FDM IN ALPHA VERSION 36 Takahiro Katagiri, Satoshi Ito, Satoshi Ohshima, "Early Experiences for Adaptation of Auto-tuning by ppOpen-AT to an Explicit Method” Special Session: Auto-Tuning for Multicore and GPU (ATMG) (In Conjunction with the IEEE MCSoC-13), National Institute of Informatics, Tokyo, Japan, September 26-28, 2013
  • 20. Test Environments 1. FX10 (The Fujitsu PRIMEHPC FX10) ◦ SPARC64 IXfx(1.848 GHz), 16 Cores, Maximum 16 Threads. ◦ Fujitsu Fortran Compiler, Version 1.2.1. ◦ Option:-Kfast, -openmp. 2. T2K (The AMD Quad-core Opteron (Barcelona)) ◦ AMD Opteron 8356 (2.3 GHz),16 Cores (4 Sockets),Maximum 16 Threads ◦ Intel Fortran Compiler, Version 11.0. ◦ Option:-fast openmp -mcmodel=medium. 3. Sandy Bridge (Intel Sandy Bridge) ◦ Xeon E5 (Sandy Bridge E5-2687W),(8 Physical Cores, 16 Threads) (3.1 GHz),(Turbo boost off),32 Cores (2 Sockets),Maximum 32 Threads. ◦ Intel Fortran Compiler, Version 12.1. ◦ Option:-fast –openmp -mcmodel=medium. 4. SR16K (HITACHI SR16000/M1) ◦ IBM Power7 (3.83 GHz),32 Cores (4 Sockets),Maximum 64 Threads (SMT) ◦ HITACHI Optimization Fortran,Version. 03-01-/B. ◦ Option: -opt=ss –omp. 37
  • 21. AT Effect: Very Small and Small 0 2 4 6 8 10 1 4 8 16 #1 #2 #3 #4 #5 #6 #7 39 (A) FX10 (VERY SMALL, #REPEAT = 100,000) #Threads Time In Seconds 0 2 4 6 8 10 1 4 8 16 #1 #2 #3 #4 #5 #6 #7 (B)T2K (VERY SMALL, #REPEAT = 100,000) 0 0.1 0.2 0.3 0.4 0.5 1 8 16 32 #1 #2 #3 #4 #5 #6 #7 #Threads Time In Seconds (C)SANDY BRIDGE (SMALL, #REPEAT = 1,000) 0 0.1 0.2 0.3 0.4 0.5 1 8 32 64 #1 #2 #3 #4 #5 #6 #7 (D)SR16K (SMALL, #REPEAT = 1,000) #2, #5 are the best. #4, #5, #7 are the best. #2, #3, #4, #5 are the best.#2, #4, #5 are the best. #5 and #7 were the best when the number of threads was increase.
  • 22. AT Effect: Large Size 0 2 4 6 8 10 12 1 4 8 16 #1 #2 #3 #4 #5 #6 #7 41 (A) FX10 (#REPEAT = 10) #Threads Time In Seconds 0 1 2 3 4 5 6 1 4 8 16 #1 #2 #3 #4 #5 #6 #7 (B)T2K (#REPEAT = 10) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1 8 16 32 #1 #2 #3 #4 #5 #6 #7 #Threads Time In Seconds (C)SANDY BRIDGE (#REPEAT = 10) 0 0.2 0.4 0.6 0.8 1 1 8 32 64 #1 #2 #3 #4 #5 #6 #7 (D)SR16K (#REPEAT = 10) #2, #3, #5 are the best.#2, #7 are the best. #5 are the best. #4 are the best. One fixed implementation was the best.
  • 23. With AT(Speedups to the case without AT) Pure MPI Types of hybrid MPI‐OpenMP Execution 2.5 AT Effect for Hybrid OpenMP‐MPI  Original without AT Pure MPI Speedup to pure MPI Execution Types of hybrid MPI‐OpenMP Execution The FX10, Kernel: update_stress 1 No merit for  Hybrid MPI‐OpenMPI Executions. 1 Effect on pure MPI Execution Gain by using MPI‐OpenMPI Executions. By adapting loop transformation from the AT, we obtained:  Maximum 1.5x speedup to pure MPI (without Thread execution)  Maximum 2.5x speedup to pure MPI in hybrid MPI‐OpenMP execution. PXTY :X Processes, Y Threads / Process
  • 25. Current Answers to AT systems Minimum software-stack requirement is important to use AT facility in supercomputers in operation. Since we have no standardization for AT functions, efforts for AT with full user-level execution are required. 51
  • 26. Future Direction  The standardization of AT functions for supercomputers is important future direction, such as: ◦ Performance monitors. ◦ Code generators, esp. dynamic code generators. ◦ Job schedulers, such as batch-job systems. ◦ Compiler optimizations including directives and compiler options. ◦ Defining AT targets, such as execution speed, memory amounts, or power consumption, etc.. ◦ etc.  Making standardization strategy for AT functions with venders is important. ◦ Message Passing Interface (MPI) standardization in MPI Forum is one of success examples for the standardization. ◦ Why not make standardization and forum for AT? 52