SlideShare a Scribd company logo
1 of 31
Download to read offline
Towards Auto-tuning Facilities
into Supercomputers in Operation
- The FIBER approach and
minimizing software-stack requirements -
Takahiro Katagiri (片桐 孝洋)
Information Technology Center,
The University of Tokyo
(東京大学 情報基盤センター)
1
2014 ATAT in HPSC, National Taiwan University,
March 15, 2014 (Saturday), Performance 10:10-10:30
Joint work with: Satoshi Ohshima(大島 聡史)
Masaharu Matsumoto(松本 正晴)
Overview
1. Background and ppOpen-HPC
Project
2. ppOpen-AT Basics
3. Adaptation to an FDM
Application
4. Performance Evaluation
5. Conclusion
2
Overview
1. Background and ppOpen-HPC
Project
2. ppOpen-AT Basics
3. Adaptation to an FDM
Application
4. Performance Evaluation
5. Conclusion
3
Background
 High-Thread Parallelism (HTP)
◦ Multi-core and many-core processors are
pervasive.
 Multicore CPUs: 8-16 cores, 16-64 Threads with Hyper
Threading (HT) or Simultaneous Multithreading (SMT)
 Many Core CPU: Xeon Phi – 60 cores, 240 Threads
with HT.
◦ Utilizing parallelism with full-threads is important.
4
 Performance Portability (PP)
◦ Keeping high performance in multiple computer environments.
 Not only multiple CPUs, but also multiple compilers.
 Run-time information, such as loop length and
number of threads, is important.
◦ Auto-tuning (AT) is one of candidates technologies to
establish PP in multiple computer environments.
ppOpen-HPC Project
 Middleware for HPC and Its AT
◦ Supported by JST, CREST, from 2011FY to 2016FY.
◦ PI: Professor Kengo Nakajima (U. Tokyo)
 ppOpen-HPC
◦ An open source infrastructure for reliable simulation
codes on post-peta (pp) scale parallel computers.
◦ consists of various types of libraries,
which covers 5 kinds of discretization methods for
scientific computations.
 ppOpen-AT
◦ An auto-tuning language for ppOpen-HPC codes
◦ Using knowledge of previous project, that is
ABCLibScript Project.
◦ Auto-tuning language based on directives of AT. 5
6
FVM DEMFDMFEM
Many-core CPUs GPU
Low Power
CPUs
Vector CPUs
MG
COMM
Auto-Tuning Facility
Code Generation for Optimization Candidates
Search for the best candidate
Automatic Execution for the optimization
Resource Allocation Facility
ppOpen-APPL
ppOpen-MATH
BEM
ppOpen-AT
User’s Program
GRAPH VIS MP
STATIC DYNAMIC
ppOpen-SYS FT
Specify
The Best
Execution
Allocations
Software Architecture of ppOpen-HPC
Overview
1. Background and ppOpen-HPC
Project
2. ppOpen-AT Basics
3. Adaptation to an FDM
Application
4. Performance Evaluation
5. Conclusion
7
Overview of FIBER (Framework of Install‐time, Before 
Execute‐time and Run‐time Auto‐tuning) [T.Katagiri et.al., 03]
#pragma oat …
Legacy codes with AT directives
#pragma oat …
#pragma oat …
Preprocessor 
of the AT 
directives
#implementation3
#implementation2
#implementation1
Legacy codes with AT functions and 
AT candidates specified by the AT directives
Compiling
FIBER
Auto‐tuner
Best 
Parameters
Performance
Database
Install‐
time
Before 
Execute‐
time
Run‐time
: AT timings defined by FIBER.
The timings are specified 
by the AT directives.
API on FIBER
Executable 
codes with AT 
functions
User 
Specifies 
Parameter
A Scenario to Software Developers for
ppOpen-AT
9
Executable Code with
Optimization Candidates
and AT Function
Invocate dedicated
Preprocessor
Software
Developer
Description of AT by Using
ppOpen-AT
Program with AT
Functions
Optimization
that cannot be
established by
compilers
#pragma oat install unroll (i,j,k) region start
#pragma oat varied (i,j,k) from 1 to 8
for(i = 0 ; i < n ; i++){
for(j = 0 ; j < n ; j++){
for(k = 0 ; k < n ; k++){
A[i][j]=A[i][j]+B[i][k]*C[k][j]; }}}
#pragma oat install unroll (i,j,k) region end
■Automatic Generated
Functions
Optimization
Candidates
Performance Monitor
Parameter Search
Performance Modeling
Description By Software Developer
Optimizations for Source Codes,
Computer Resource, Power Consumption
Compiler Optimization and AT
1. Loop length is unclear in compile‐time.
 Optimal loop split and loop fusion are specified in run‐time.
 Run‐time compiling is on only research.
2. Loop split with data dependencies.
 Some loop splits require increase of computations or memory 
space.
 Some compilers are providing directive, but the directive is not 
standardized. 
 Code optimization is not also standardized between compilers.
3. Restrictions from Operation in Supercomputers.  
 Some supercomputer environments cannot supply  required “software‐
stack”, or the software‐stack cannot be utilize due to restriction by operation.
 Out of target for the system due to hardware restriction.
 Ex) CAPS in the K‐computer.
 Operation costs (budgets), vender strategy, etc…. 10
Overview
1. Background and ppOpen-HPC
Project
2. ppOpen-AT Basics
3. Adaptation to an FDM
Application
4. Performance Evaluation
5. Conclusion
11
EARLY EXPERIENCE IN
EXPLICIT METHOD
(FINITE DIFFERENCE
METHOD)
12
Target Application
Seism3D
: Simulation software for
seismic wave analysis.
 Strategic simulation software in Japan.
 Developed by Professor Furumura
at the University of Tokyo.
◦ The code is re-constructed as
ppOpen-APPL/FDM.
 Finite Differential Method (FDM)
 3D simulation
◦ 3D arrays are allocated.
 Data type: Single Precision (real*4)
13
Source: http://www.eri.u-
tokyo.ac.jp/furumura/tsunami
/tsunami.html
The Heaviest Loop (20%+ to Total Time)
14
!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,
DXVZDZVX1,DYVZDZV1)
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
RL1 = LAM (I,J,K)
RM1 = RIG (I,J,K)
RM2 = RM1 + RM1; RLRM2 = RL1+RM2;
DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K)
DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1
SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DT
SYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DT
SZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DT
DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K)
DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);
DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)
SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT
SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT
SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
END DO
END DO
END DO
!$omp end parallel do
A Flow Dependency
Optimization Possibilities
 Loop Splitting
◦ To reduce spill code.
◦ To maximize register usage.
 Loop fusion (Loop Collapse)
◦ 3 nested loop -> The following two approaches.
◦ One nest loop
 To increase outer loop parallelism for thread
parallelism.
◦ Two nested loop
 To increase outer loop parallelism for thread
parallelism.
 To utilize pre-fetching for the inner loop.
15
Loop fusion –
One dimensional (a loop collapse)
16
!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,
DXVZDZVX1,DYVZDZV1)
DO KK = 1, NZ * NY * NX
K = (KK-1)/(NY*NX) + 1
J = mod((KK-1)/NX,NY) + 1
I = mod(KK-1,NX) + 1
RL1 = LAM (I,J,K)
RM1 = RIG (I,J,K)
RM2 = RM1 + RM1; RLRM2 = RL1+RM2;
DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K)
DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1
SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DT
SYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DT
SZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DT
DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K)
DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);
DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)
SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT
SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT
SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
END DO
!$omp end parallel do
Merit: Loop length is huge.
This is good for OpenMP thread parallelism.
Loop fusion – Two dimensional
17
!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,
DXVZDZVX1,DYVZDZV1)
DO KK = 1, NZ * NY
K = (KK-1)/NY + 1
J = mod(KK-1,NY) + 1
DO I = 1, NX
RL1 = LAM (I,J,K)
RM1 = RIG (I,J,K)
RM2 = RM1 + RM1; RLRM2 = RL1+RM2;
DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K)
DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1
SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DT
SYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DT
SZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DT
DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K)
DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);
DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)
SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT
SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT
SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
ENDDO
END DO
!$omp end parallel do
 Example:
Merit: Loop length is huge.
This is good for OpenMP thread parallelism.
This I-loop enables us an opportunity of pre-fetching.
18
!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,
DXVZDZVX1,DYVZDZV1)
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
RL1 = LAM (I,J,K)
RM1 = RIG (I,J,K)
RM2 = RM1 + RM1; RLRM2 = RL1+RM2;
DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K)
DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1
SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DT
SYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DT
SZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DT
ENDDO
DO I = 1, NX
RM1 = RIG (I,J,K)
DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K)
DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);
DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)
SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT
SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT
SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
END DO
END DO
END DO
Re-computation
(a copy) is needed.
⇒Compilers
do not apply it
without directive.
Perfect Splitting: Two 3-nested Loops
New Directives for ppOpen‐AT 
• m_stress.f90(ppohFDM_update_stress)
!OAT$ install LoopFusionSplit region start
!$omp parallel do private (k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1)
do k = NZ00, NZ01
do j = NY00, NY01
do i = NX00, NX01
RL1   = LAM (I,J,K)
!OAT$ SplitPointCopyDef sub region start
RM1   = RIG (I,J,K)
!OAT$ SplitPointCopyDef sub region end
RM2   = RM1 + RM1;  RLRM2 = RL1+RM2; 
DXVX1 = DXVX(I,J,K);  DYVY1 = DYVY(I,J,K); DZVZ1 = DZVZ(I,J,K)
D3V3  = DXVX1 + DYVY1 + DZVZ1
SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)‐RM2*(DZVZ1+DYVY1) ) * DT
SYY (I,J,K) = SYY (I,J,K)  + (RLRM2*(D3V3)‐RM2*(DXVX1+DZVZ1) ) * DT
SZZ (I,J,K) = SZZ (I,J,K)  + (RLRM2*(D3V3)‐RM2*(DXVX1+DYVY1) ) * DT
!OAT$ SplitPoint (K,J,I)
!OAT$ SplitPointCopyInsert
DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K);    DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);
DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)
SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT
SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT
SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
end do
end do
end do
!$omp end parallel do
!OAT$ install LoopFusionSplit region end
Re-calculation
is defined in here.
Using the re-calculation
is defined in here.
Loop Split Point
Candidates of Auto-generated Codes
 #1 [Baseline] : Original three-nested loop.
 #2 [Spilt] : Loop split for the k-loop
(separated two three-nested loops).
 #3 [Split] : Loop split for the j-loop.
 #4 [Split] : Loop split for the i-loop.
 #5 [Fusion] : Loop fusion for the k-loop and j-loop
(a two-nested loop).
 #6 [Split and Fusion] : Loop fusions for the k-loop
and j-loop for the loops in #2.
 #7 [Fusion] : Loop fusions for the k-loop, j-loop,
and i-loop (loop collapse).
 #8 [Split and Fusion] : Loop fusions for the k-loop, j-loop,
and i-loop for the loops in #2
(loop collapses for the separated two-loops).
20
Overview
1. Background and ppOpen-HPC
Project
2. ppOpen-AT Basics
3. Adaptation to an FDM
Application
4. Performance Evaluation
5. Conclusion
21
PERFORMANCE
EVALUATION
22
An Example of Seism3D Simulation
 West part earthquake in Tottori prefecture in Japan
at year 2000. ([1], pp.14)
 The region of 820km x 410km x 128 km is discretized with 0.4km.
 NX x NY x NZ = 2050 x 1025 x 320 ≒ 6.4 : 3.2 : 1.
[1] T. Furumura, “Large-scale Parallel FDM Simulation for Seismic Waves and Strong Shaking”, Supercomputing News,
Information Technology Center, The University of Tokyo, Vol.11, Special Edition 1, 2009. In Japanese.
Figure : Seismic wave translations in west part earthquake in Tottori prefecture in Japan.
(a) Measured waves; (b) Simulation results; (Reference : [1] in pp.13)
Test Condition
 Software version
◦ ppOpen-APPL/FDM version 0.2
◦ ppOpen-AT version 0.2
 Target Kernels in ppOpen-APPL/FDM
◦ TOP 10 Kernels (All three-nested loops)
 Update_stress
 Update_vel
 Update_spong
 Other 7 kernels in finite differential computations.
 AT Timing
◦ Before Execute-time
 After fixing problem size and the number of threads by user.
 Then, adapt AT in time for calling of the library routine.
 All candidates of AT are evaluated. (Brute-force search)
◦ Only 8+3+6+7*3 = 38 candidates.
 #Repeats for each kernel in the AT mode
◦ 100 times
24
The Xeon Phi Cluster System
 Intel Xeon (Ivy Bridge) : HOST CPU 
 OS:Red Hat Enterprise Linux Server release 6.2
 #Nodes:32 (Available: 14 nodes)
 CPU:Intel Xeon E5‐2670 V2 @ 2.50GHz,2 sockets×10 cores
 Hyper Threading:ON
 Theoretical Peak Performance for 1 node of CPU:400 GFLOPS
 Memory size on 1 node:64 GB
 Interconnect:Infiniband
 Compiler:Intel Fortran version 14.0.0.080 Build 20130728
 Compiler Option:‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel
 KMP_AFFINITY=granularity=fine, compact (all threads are on socket)
 Intel Xeon Phi co‐processor (Xeon Phi) : Accelerator 
 CPU:Xeon Phi 5110P (B1 stepping) 1.053 GHz,60 core
 Memory size:8 GB
 Theoretical Peak Performance :1 TFLOPS ( = 1.053 GHz x 16 FLOPS x 60 core)
 Connected one board on each node of the Cluster 
 Native mode
 Compiler:Intel Fortran version 14.0.0.080 Build 20130728
 Compiler Option:‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel 
–mmiccl ‐align array64byte
 KMP_AFFINITY=granularity=fine, balanced (all threads are equally distributed on 
socket)
RESULT OF THE XEON PHI
Execution Details
• ppOpen‐APPL/FDM ver.0.2
• ppOpen‐AT ver.0.2
• Target Problem Size
– NX * NY * NZ = 256 x 96 x 100 / node
– NX * NY * NZ = 32 * 16 * 20 / core (!= per MPI Process)
• Native mode for MIC
• Target MPI Processes and Threads on the Xeon Phi
– 1 node of the Xeon Phi with 4 HT (Hyper Threading)
– PXTY : X MPI Processes and Y Threads per process
– P240T1 : pure MPI with 4HT per core
– P120T2
– P60T4
– P16T15
– P8T30 : Minimum Hybrid MPI‐OpenMP execution for 
ppOpen‐APPL/FDM, since it needs minimum 8 MPI Processes.  
• The number of iterations for the kernels: 100
2.11 
2.32  2.33 
2.96  3.14 
1.29 
1.70  1.74  1.91  1.97 
0
1
2
3
4
P240T1 P120T2 P60T4 P16T15 P8T30
Without AT With AT
AT Effect (update_stress, Xeon Phi)[Seconds]
KMP_AFFINITY=balanced
‐align array64byte New Kernels
1.63 
1.36  1.34 
1.55  1.59 
0
0.5
1
1.5
2
P240T1 P120T2 P60T4 P16T15 P8T30
Speedups
Best SW: 6 Best SW: 5 Best SW: 5 Best SW: 5 Best SW: 6
Conclusion
 Loop fusion to obtain high parallelism
is one of key techniques for current
multi- and many-core architectures.
◦ Execution with 240 threads/MPI process
in the Xeon Phi.
◦ Strong scaling with more than 10,000+ cores
in the FX10.
 To do AT in supercomputers
in operation, minimizing requirement
of “software-stack” is a practical way
to establish AT.
ppOpen-AT is free software!
ppOpen-AT version 0.2 is
available!
The licensing is MIT.
Please access the following page:
http://ppopenhpc.cc.u-tokyo.ac.jp/
30
31
Thank for your attention!
Questions?

More Related Content

What's hot

Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-lineari...
Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-lineari...Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-lineari...
Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-lineari...Sho Takase
 
【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning FrameworksTakeo Imai
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
 
190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pubJaewook. Kang
 
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)Jyh-Miin Lin
 
Summary - Adaptive Insertion Policies for High Performance Caching. Qureshi, ...
Summary - Adaptive Insertion Policies for High Performance Caching. Qureshi, ...Summary - Adaptive Insertion Policies for High Performance Caching. Qureshi, ...
Summary - Adaptive Insertion Policies for High Performance Caching. Qureshi, ...Jose Pinilla
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)Hansol Kang
 
Nyquist criterion for zero ISI
Nyquist criterion for zero ISINyquist criterion for zero ISI
Nyquist criterion for zero ISIGunasekara Reddy
 
Non-Extended Schemes for Inter-Subchannel
Non-Extended Schemes for Inter-SubchannelNon-Extended Schemes for Inter-Subchannel
Non-Extended Schemes for Inter-SubchannelShih-Chi Liao
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
presentation_final
presentation_finalpresentation_final
presentation_finalSushanta Roy
 
Papr reduction for ofdm oqam signals via alternative signal method
Papr reduction for ofdm   oqam signals via alternative   signal methodPapr reduction for ofdm   oqam signals via alternative   signal method
Papr reduction for ofdm oqam signals via alternative signal methodeSAT Journals
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleMarina Kolpakova
 
A Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB SimulinkA Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB SimulinkJaewook. Kang
 
Code GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowCode GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowMarina Kolpakova
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsLinaro
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016 OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016 otoyinc
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistryguest5929fa7
 

What's hot (20)

Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-lineari...
Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-lineari...Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-lineari...
Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-lineari...
 
【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks
 
Am32265271
Am32265271Am32265271
Am32265271
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub
 
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
 
Summary - Adaptive Insertion Policies for High Performance Caching. Qureshi, ...
Summary - Adaptive Insertion Policies for High Performance Caching. Qureshi, ...Summary - Adaptive Insertion Policies for High Performance Caching. Qureshi, ...
Summary - Adaptive Insertion Policies for High Performance Caching. Qureshi, ...
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
 
Nyquist criterion for zero ISI
Nyquist criterion for zero ISINyquist criterion for zero ISI
Nyquist criterion for zero ISI
 
Non-Extended Schemes for Inter-Subchannel
Non-Extended Schemes for Inter-SubchannelNon-Extended Schemes for Inter-Subchannel
Non-Extended Schemes for Inter-Subchannel
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
presentation_final
presentation_finalpresentation_final
presentation_final
 
Papr reduction for ofdm oqam signals via alternative signal method
Papr reduction for ofdm   oqam signals via alternative   signal methodPapr reduction for ofdm   oqam signals via alternative   signal method
Papr reduction for ofdm oqam signals via alternative signal method
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
 
A Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB SimulinkA Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB Simulink
 
Code GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowCode GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flow
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loops
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
 
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016 OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistry
 

Viewers also liked

ppOpen-AT : Yet Another Directive-base AT Language
ppOpen-AT : Yet Another Directive-base AT LanguageppOpen-AT : Yet Another Directive-base AT Language
ppOpen-AT : Yet Another Directive-base AT LanguageTakahiro Katagiri
 
Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...
Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...
Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...Takahiro Katagiri
 
ATTA2014基盤B導入(片桐)
ATTA2014基盤B導入(片桐)ATTA2014基盤B導入(片桐)
ATTA2014基盤B導入(片桐)Takahiro Katagiri
 
自動チューニングとビックデータ:機械学習の適用の可能性
自動チューニングとビックデータ:機械学習の適用の可能性自動チューニングとビックデータ:機械学習の適用の可能性
自動チューニングとビックデータ:機械学習の適用の可能性Takahiro Katagiri
 
ppOpen-ATによる静的コード生成で実現する 自動チューニング方式の評価
ppOpen-ATによる静的コード生成で実現する自動チューニング方式の評価ppOpen-ATによる静的コード生成で実現する自動チューニング方式の評価
ppOpen-ATによる静的コード生成で実現する 自動チューニング方式の評価Takahiro Katagiri
 
SCG-AT:静的コード生成のみによる自動チューニング実現方式
SCG-AT:静的コード生成のみによる自動チューニング実現方式SCG-AT:静的コード生成のみによる自動チューニング実現方式
SCG-AT:静的コード生成のみによる自動チューニング実現方式Takahiro Katagiri
 
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...Takahiro Katagiri
 
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...Takahiro Katagiri
 
Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-AT
Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-ATImpact of Auto-tuning of Kernel Loop Transformation by using ppOpen-AT
Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-ATTakahiro Katagiri
 
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開Takahiro Katagiri
 
Auto‐Tuning of Hierarchical Computations with ppOpen‐AT
Auto‐Tuning of Hierarchical Computations with ppOpen‐ATAuto‐Tuning of Hierarchical Computations with ppOpen‐AT
Auto‐Tuning of Hierarchical Computations with ppOpen‐ATTakahiro Katagiri
 
ソフトウェア自動チューニング研究紹介
ソフトウェア自動チューニング研究紹介ソフトウェア自動チューニング研究紹介
ソフトウェア自動チューニング研究紹介Takahiro Katagiri
 

Viewers also liked (14)

ppOpen-AT : Yet Another Directive-base AT Language
ppOpen-AT : Yet Another Directive-base AT LanguageppOpen-AT : Yet Another Directive-base AT Language
ppOpen-AT : Yet Another Directive-base AT Language
 
Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...
Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...
Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...
 
ATTA2014基盤B導入(片桐)
ATTA2014基盤B導入(片桐)ATTA2014基盤B導入(片桐)
ATTA2014基盤B導入(片桐)
 
自動チューニングとビックデータ:機械学習の適用の可能性
自動チューニングとビックデータ:機械学習の適用の可能性自動チューニングとビックデータ:機械学習の適用の可能性
自動チューニングとビックデータ:機械学習の適用の可能性
 
Ase20 20151016 hp
Ase20 20151016 hpAse20 20151016 hp
Ase20 20151016 hp
 
ppOpen-ATによる静的コード生成で実現する 自動チューニング方式の評価
ppOpen-ATによる静的コード生成で実現する自動チューニング方式の評価ppOpen-ATによる静的コード生成で実現する自動チューニング方式の評価
ppOpen-ATによる静的コード生成で実現する 自動チューニング方式の評価
 
SCG-AT:静的コード生成のみによる自動チューニング実現方式
SCG-AT:静的コード生成のみによる自動チューニング実現方式SCG-AT:静的コード生成のみによる自動チューニング実現方式
SCG-AT:静的コード生成のみによる自動チューニング実現方式
 
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...
 
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...
 
Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-AT
Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-ATImpact of Auto-tuning of Kernel Loop Transformation by using ppOpen-AT
Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-AT
 
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開
 
iWAPT2015_katagiri
iWAPT2015_katagiriiWAPT2015_katagiri
iWAPT2015_katagiri
 
Auto‐Tuning of Hierarchical Computations with ppOpen‐AT
Auto‐Tuning of Hierarchical Computations with ppOpen‐ATAuto‐Tuning of Hierarchical Computations with ppOpen‐AT
Auto‐Tuning of Hierarchical Computations with ppOpen‐AT
 
ソフトウェア自動チューニング研究紹介
ソフトウェア自動チューニング研究紹介ソフトウェア自動チューニング研究紹介
ソフトウェア自動チューニング研究紹介
 

Similar to Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)elliando dias
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsMarina Kolpakova
 
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization and
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization andIaetsd fpga implementation of cordic algorithm for pipelined fft realization and
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization andIaetsd Iaetsd
 
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06ManhHoangVan
 
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...cscpconf
 
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...csandit
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdfFrangoCamila
 
A Novel Route Optimized Cluster Based Routing Protocol for Pollution Controll...
A Novel Route Optimized Cluster Based Routing Protocol for Pollution Controll...A Novel Route Optimized Cluster Based Routing Protocol for Pollution Controll...
A Novel Route Optimized Cluster Based Routing Protocol for Pollution Controll...IRJET Journal
 
Iaetsd finger print recognition by cordic algorithm and pipelined fft
Iaetsd finger print recognition by cordic algorithm and pipelined fftIaetsd finger print recognition by cordic algorithm and pipelined fft
Iaetsd finger print recognition by cordic algorithm and pipelined fftIaetsd Iaetsd
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainabilitygeetachauhan
 
TLD Anycast DNS servers to ISPs
TLD Anycast DNS servers to ISPsTLD Anycast DNS servers to ISPs
TLD Anycast DNS servers to ISPsAPNIC
 
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...William Nadolski
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesMarina Kolpakova
 
Stranger in These Parts. A Hired Gun in the JS Corral (JSConf US 2012)
Stranger in These Parts. A Hired Gun in the JS Corral (JSConf US 2012)Stranger in These Parts. A Hired Gun in the JS Corral (JSConf US 2012)
Stranger in These Parts. A Hired Gun in the JS Corral (JSConf US 2012)Igalia
 
Dimensioning of IP Backbone
Dimensioning of IP BackboneDimensioning of IP Backbone
Dimensioning of IP BackboneEM Archieve
 
Georgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityGeorgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityDefconRussia
 

Similar to Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements - (20)

Handout#10
Handout#10Handout#10
Handout#10
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
 
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization and
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization andIaetsd fpga implementation of cordic algorithm for pipelined fft realization and
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization and
 
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
 
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
 
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdf
 
A Novel Route Optimized Cluster Based Routing Protocol for Pollution Controll...
A Novel Route Optimized Cluster Based Routing Protocol for Pollution Controll...A Novel Route Optimized Cluster Based Routing Protocol for Pollution Controll...
A Novel Route Optimized Cluster Based Routing Protocol for Pollution Controll...
 
Iaetsd finger print recognition by cordic algorithm and pipelined fft
Iaetsd finger print recognition by cordic algorithm and pipelined fftIaetsd finger print recognition by cordic algorithm and pipelined fft
Iaetsd finger print recognition by cordic algorithm and pipelined fft
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainability
 
My paper
My paperMy paper
My paper
 
TLD Anycast DNS servers to ISPs
TLD Anycast DNS servers to ISPsTLD Anycast DNS servers to ISPs
TLD Anycast DNS servers to ISPs
 
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
 
Macro
MacroMacro
Macro
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Stranger in These Parts. A Hired Gun in the JS Corral (JSConf US 2012)
Stranger in These Parts. A Hired Gun in the JS Corral (JSConf US 2012)Stranger in These Parts. A Hired Gun in the JS Corral (JSConf US 2012)
Stranger in These Parts. A Hired Gun in the JS Corral (JSConf US 2012)
 
Dimensioning of IP Backbone
Dimensioning of IP BackboneDimensioning of IP Backbone
Dimensioning of IP Backbone
 
Georgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityGeorgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software security
 

Recently uploaded

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

  • 1. Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements - Takahiro Katagiri (片桐 孝洋) Information Technology Center, The University of Tokyo (東京大学 情報基盤センター) 1 2014 ATAT in HPSC, National Taiwan University, March 15, 2014 (Saturday), Performance 10:10-10:30 Joint work with: Satoshi Ohshima(大島 聡史) Masaharu Matsumoto(松本 正晴)
  • 2. Overview 1. Background and ppOpen-HPC Project 2. ppOpen-AT Basics 3. Adaptation to an FDM Application 4. Performance Evaluation 5. Conclusion 2
  • 3. Overview 1. Background and ppOpen-HPC Project 2. ppOpen-AT Basics 3. Adaptation to an FDM Application 4. Performance Evaluation 5. Conclusion 3
  • 4. Background  High-Thread Parallelism (HTP) ◦ Multi-core and many-core processors are pervasive.  Multicore CPUs: 8-16 cores, 16-64 Threads with Hyper Threading (HT) or Simultaneous Multithreading (SMT)  Many Core CPU: Xeon Phi – 60 cores, 240 Threads with HT. ◦ Utilizing parallelism with full-threads is important. 4  Performance Portability (PP) ◦ Keeping high performance in multiple computer environments.  Not only multiple CPUs, but also multiple compilers.  Run-time information, such as loop length and number of threads, is important. ◦ Auto-tuning (AT) is one of candidates technologies to establish PP in multiple computer environments.
  • 5. ppOpen-HPC Project  Middleware for HPC and Its AT ◦ Supported by JST, CREST, from 2011FY to 2016FY. ◦ PI: Professor Kengo Nakajima (U. Tokyo)  ppOpen-HPC ◦ An open source infrastructure for reliable simulation codes on post-peta (pp) scale parallel computers. ◦ consists of various types of libraries, which covers 5 kinds of discretization methods for scientific computations.  ppOpen-AT ◦ An auto-tuning language for ppOpen-HPC codes ◦ Using knowledge of previous project, that is ABCLibScript Project. ◦ Auto-tuning language based on directives of AT. 5
  • 6. 6 FVM DEMFDMFEM Many-core CPUs GPU Low Power CPUs Vector CPUs MG COMM Auto-Tuning Facility Code Generation for Optimization Candidates Search for the best candidate Automatic Execution for the optimization Resource Allocation Facility ppOpen-APPL ppOpen-MATH BEM ppOpen-AT User’s Program GRAPH VIS MP STATIC DYNAMIC ppOpen-SYS FT Specify The Best Execution Allocations Software Architecture of ppOpen-HPC
  • 7. Overview 1. Background and ppOpen-HPC Project 2. ppOpen-AT Basics 3. Adaptation to an FDM Application 4. Performance Evaluation 5. Conclusion 7
  • 9. A Scenario to Software Developers for ppOpen-AT 9 Executable Code with Optimization Candidates and AT Function Invocate dedicated Preprocessor Software Developer Description of AT by Using ppOpen-AT Program with AT Functions Optimization that cannot be established by compilers #pragma oat install unroll (i,j,k) region start #pragma oat varied (i,j,k) from 1 to 8 for(i = 0 ; i < n ; i++){ for(j = 0 ; j < n ; j++){ for(k = 0 ; k < n ; k++){ A[i][j]=A[i][j]+B[i][k]*C[k][j]; }}} #pragma oat install unroll (i,j,k) region end ■Automatic Generated Functions Optimization Candidates Performance Monitor Parameter Search Performance Modeling Description By Software Developer Optimizations for Source Codes, Computer Resource, Power Consumption
  • 10. Compiler Optimization and AT 1. Loop length is unclear in compile‐time.  Optimal loop split and loop fusion are specified in run‐time.  Run‐time compiling is on only research. 2. Loop split with data dependencies.  Some loop splits require increase of computations or memory  space.  Some compilers are providing directive, but the directive is not  standardized.   Code optimization is not also standardized between compilers. 3. Restrictions from Operation in Supercomputers.    Some supercomputer environments cannot supply  required “software‐ stack”, or the software‐stack cannot be utilize due to restriction by operation.  Out of target for the system due to hardware restriction.  Ex) CAPS in the K‐computer.  Operation costs (budgets), vender strategy, etc…. 10
  • 11. Overview 1. Background and ppOpen-HPC Project 2. ppOpen-AT Basics 3. Adaptation to an FDM Application 4. Performance Evaluation 5. Conclusion 11
  • 12. EARLY EXPERIENCE IN EXPLICIT METHOD (FINITE DIFFERENCE METHOD) 12
  • 13. Target Application Seism3D : Simulation software for seismic wave analysis.  Strategic simulation software in Japan.  Developed by Professor Furumura at the University of Tokyo. ◦ The code is re-constructed as ppOpen-APPL/FDM.  Finite Differential Method (FDM)  3D simulation ◦ 3D arrays are allocated.  Data type: Single Precision (real*4) 13 Source: http://www.eri.u- tokyo.ac.jp/furumura/tsunami /tsunami.html
  • 14. The Heaviest Loop (20%+ to Total Time) 14 !$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1, DXVZDZVX1,DYVZDZV1) DO K = 1, NZ DO J = 1, NY DO I = 1, NX RL1 = LAM (I,J,K) RM1 = RIG (I,J,K) RM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K) DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1 SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DT SYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DT SZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DT DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K) DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K); DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K) SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT END DO END DO END DO !$omp end parallel do A Flow Dependency
  • 15. Optimization Possibilities  Loop Splitting ◦ To reduce spill code. ◦ To maximize register usage.  Loop fusion (Loop Collapse) ◦ 3 nested loop -> The following two approaches. ◦ One nest loop  To increase outer loop parallelism for thread parallelism. ◦ Two nested loop  To increase outer loop parallelism for thread parallelism.  To utilize pre-fetching for the inner loop. 15
  • 16. Loop fusion – One dimensional (a loop collapse) 16 !$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1, DXVZDZVX1,DYVZDZV1) DO KK = 1, NZ * NY * NX K = (KK-1)/(NY*NX) + 1 J = mod((KK-1)/NX,NY) + 1 I = mod(KK-1,NX) + 1 RL1 = LAM (I,J,K) RM1 = RIG (I,J,K) RM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K) DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1 SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DT SYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DT SZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DT DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K) DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K); DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K) SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT END DO !$omp end parallel do Merit: Loop length is huge. This is good for OpenMP thread parallelism.
  • 17. Loop fusion – Two dimensional 17 !$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1, DXVZDZVX1,DYVZDZV1) DO KK = 1, NZ * NY K = (KK-1)/NY + 1 J = mod(KK-1,NY) + 1 DO I = 1, NX RL1 = LAM (I,J,K) RM1 = RIG (I,J,K) RM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K) DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1 SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DT SYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DT SZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DT DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K) DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K); DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K) SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT ENDDO END DO !$omp end parallel do  Example: Merit: Loop length is huge. This is good for OpenMP thread parallelism. This I-loop enables us an opportunity of pre-fetching.
  • 18. 18 !$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1, DXVZDZVX1,DYVZDZV1) DO K = 1, NZ DO J = 1, NY DO I = 1, NX RL1 = LAM (I,J,K) RM1 = RIG (I,J,K) RM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K) DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1 SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DT SYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DT SZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DT ENDDO DO I = 1, NX RM1 = RIG (I,J,K) DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K) DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K); DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K) SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT END DO END DO END DO Re-computation (a copy) is needed. ⇒Compilers do not apply it without directive. Perfect Splitting: Two 3-nested Loops
  • 19. New Directives for ppOpen‐AT  • m_stress.f90(ppohFDM_update_stress) !OAT$ install LoopFusionSplit region start !$omp parallel do private (k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1) do k = NZ00, NZ01 do j = NY00, NY01 do i = NX00, NX01 RL1   = LAM (I,J,K) !OAT$ SplitPointCopyDef sub region start RM1   = RIG (I,J,K) !OAT$ SplitPointCopyDef sub region end RM2   = RM1 + RM1;  RLRM2 = RL1+RM2;  DXVX1 = DXVX(I,J,K);  DYVY1 = DYVY(I,J,K); DZVZ1 = DZVZ(I,J,K) D3V3  = DXVX1 + DYVY1 + DZVZ1 SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)‐RM2*(DZVZ1+DYVY1) ) * DT SYY (I,J,K) = SYY (I,J,K)  + (RLRM2*(D3V3)‐RM2*(DXVX1+DZVZ1) ) * DT SZZ (I,J,K) = SZZ (I,J,K)  + (RLRM2*(D3V3)‐RM2*(DXVX1+DYVY1) ) * DT !OAT$ SplitPoint (K,J,I) !OAT$ SplitPointCopyInsert DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K);    DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K); DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K) SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT end do end do end do !$omp end parallel do !OAT$ install LoopFusionSplit region end Re-calculation is defined in here. Using the re-calculation is defined in here. Loop Split Point
  • 20. Candidates of Auto-generated Codes  #1 [Baseline] : Original three-nested loop.  #2 [Spilt] : Loop split for the k-loop (separated two three-nested loops).  #3 [Split] : Loop split for the j-loop.  #4 [Split] : Loop split for the i-loop.  #5 [Fusion] : Loop fusion for the k-loop and j-loop (a two-nested loop).  #6 [Split and Fusion] : Loop fusions for the k-loop and j-loop for the loops in #2.  #7 [Fusion] : Loop fusions for the k-loop, j-loop, and i-loop (loop collapse).  #8 [Split and Fusion] : Loop fusions for the k-loop, j-loop, and i-loop for the loops in #2 (loop collapses for the separated two-loops). 20
  • 21. Overview 1. Background and ppOpen-HPC Project 2. ppOpen-AT Basics 3. Adaptation to an FDM Application 4. Performance Evaluation 5. Conclusion 21
  • 23. An Example of Seism3D Simulation  West part earthquake in Tottori prefecture in Japan at year 2000. ([1], pp.14)  The region of 820km x 410km x 128 km is discretized with 0.4km.  NX x NY x NZ = 2050 x 1025 x 320 ≒ 6.4 : 3.2 : 1. [1] T. Furumura, “Large-scale Parallel FDM Simulation for Seismic Waves and Strong Shaking”, Supercomputing News, Information Technology Center, The University of Tokyo, Vol.11, Special Edition 1, 2009. In Japanese. Figure : Seismic wave translations in west part earthquake in Tottori prefecture in Japan. (a) Measured waves; (b) Simulation results; (Reference : [1] in pp.13)
  • 24. Test Condition  Software version ◦ ppOpen-APPL/FDM version 0.2 ◦ ppOpen-AT version 0.2  Target Kernels in ppOpen-APPL/FDM ◦ TOP 10 Kernels (All three-nested loops)  Update_stress  Update_vel  Update_spong  Other 7 kernels in finite differential computations.  AT Timing ◦ Before Execute-time  After fixing problem size and the number of threads by user.  Then, adapt AT in time for calling of the library routine.  All candidates of AT are evaluated. (Brute-force search) ◦ Only 8+3+6+7*3 = 38 candidates.  #Repeats for each kernel in the AT mode ◦ 100 times 24
  • 25. The Xeon Phi Cluster System  Intel Xeon (Ivy Bridge) : HOST CPU   OS:Red Hat Enterprise Linux Server release 6.2  #Nodes:32 (Available: 14 nodes)  CPU:Intel Xeon E5‐2670 V2 @ 2.50GHz,2 sockets×10 cores  Hyper Threading:ON  Theoretical Peak Performance for 1 node of CPU:400 GFLOPS  Memory size on 1 node:64 GB  Interconnect:Infiniband  Compiler:Intel Fortran version 14.0.0.080 Build 20130728  Compiler Option:‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel  KMP_AFFINITY=granularity=fine, compact (all threads are on socket)  Intel Xeon Phi co‐processor (Xeon Phi) : Accelerator   CPU:Xeon Phi 5110P (B1 stepping) 1.053 GHz,60 core  Memory size:8 GB  Theoretical Peak Performance :1 TFLOPS ( = 1.053 GHz x 16 FLOPS x 60 core)  Connected one board on each node of the Cluster   Native mode  Compiler:Intel Fortran version 14.0.0.080 Build 20130728  Compiler Option:‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel  –mmiccl ‐align array64byte  KMP_AFFINITY=granularity=fine, balanced (all threads are equally distributed on  socket)
  • 27. Execution Details • ppOpen‐APPL/FDM ver.0.2 • ppOpen‐AT ver.0.2 • Target Problem Size – NX * NY * NZ = 256 x 96 x 100 / node – NX * NY * NZ = 32 * 16 * 20 / core (!= per MPI Process) • Native mode for MIC • Target MPI Processes and Threads on the Xeon Phi – 1 node of the Xeon Phi with 4 HT (Hyper Threading) – PXTY : X MPI Processes and Y Threads per process – P240T1 : pure MPI with 4HT per core – P120T2 – P60T4 – P16T15 – P8T30 : Minimum Hybrid MPI‐OpenMP execution for  ppOpen‐APPL/FDM, since it needs minimum 8 MPI Processes.   • The number of iterations for the kernels: 100
  • 28. 2.11  2.32  2.33  2.96  3.14  1.29  1.70  1.74  1.91  1.97  0 1 2 3 4 P240T1 P120T2 P60T4 P16T15 P8T30 Without AT With AT AT Effect (update_stress, Xeon Phi)[Seconds] KMP_AFFINITY=balanced ‐align array64byte New Kernels 1.63  1.36  1.34  1.55  1.59  0 0.5 1 1.5 2 P240T1 P120T2 P60T4 P16T15 P8T30 Speedups Best SW: 6 Best SW: 5 Best SW: 5 Best SW: 5 Best SW: 6
  • 29. Conclusion  Loop fusion to obtain high parallelism is one of key techniques for current multi- and many-core architectures. ◦ Execution with 240 threads/MPI process in the Xeon Phi. ◦ Strong scaling with more than 10,000+ cores in the FX10.  To do AT in supercomputers in operation, minimizing requirement of “software-stack” is a practical way to establish AT.
  • 30. ppOpen-AT is free software! ppOpen-AT version 0.2 is available! The licensing is MIT. Please access the following page: http://ppopenhpc.cc.u-tokyo.ac.jp/ 30
  • 31. 31 Thank for your attention! Questions?