Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Towards Auto-tuning Facilities
into Supercomputers in Operation
- The FIBER approach and
minimizing software-stack requirements -
Takahiro Katagiri (片桐孝洋)
Information Technology Center,
The University of Tokyo
（東京大学情報基盤センター）
1
2014 ATAT in HPSC, National Taiwan University,
March 15, 2014 (Saturday), Performance 10:10-10:30
Joint work with: Satoshi Ohshima（大島聡史）
Masaharu Matsumoto（松本正晴）

Overview
1. Background and ppOpen-HPC
Project
2. ppOpen-AT Basics
3. Adaptation to an FDM
Application
4. Performance Evaluation
5. Conclusion
2

Overview
Project
2. ppOpen-AT Basics
Application
5. Conclusion
3

Background
 High-Thread Parallelism (HTP)
◦ Multi-core and many-core processors are
pervasive.
 Multicore CPUs: 8-16 cores, 16-64 Threads with Hyper
Threading (HT) or Simultaneous Multithreading (SMT)
 Many Core CPU: Xeon Phi – 60 cores, 240 Threads
with HT.
◦ Utilizing parallelism with full-threads is important.
4
 Performance Portability (PP)
◦ Keeping high performance in multiple computer environments.
 Not only multiple CPUs, but also multiple compilers.
 Run-time information, such as loop length and
number of threads, is important.
◦ Auto-tuning (AT) is one of candidates technologies to
establish PP in multiple computer environments.

ppOpen-HPC Project
 Middleware for HPC and Its AT
◦ Supported by JST, CREST, from 2011FY to 2016FY.
◦ PI: Professor Kengo Nakajima (U. Tokyo)
 ppOpen-HPC
◦ An open source infrastructure for reliable simulation
codes on post-peta (pp) scale parallel computers.
◦ consists of various types of libraries,
which covers 5 kinds of discretization methods for
scientific computations.
 ppOpen-AT
◦ An auto-tuning language for ppOpen-HPC codes
◦ Using knowledge of previous project, that is
ABCLibScript Project.
◦ Auto-tuning language based on directives of AT. 5

6
FVM DEMFDMFEM
Many-core CPUs GPU
Low Power
CPUs
Vector CPUs
MG
COMM
Auto-Tuning Facility
Code Generation for Optimization Candidates
Search for the best candidate
Automatic Execution for the optimization
Resource Allocation Facility
ppOpen-APPL
ppOpen-MATH
BEM
ppOpen-AT
User’s Program
GRAPH VIS MP
STATIC DYNAMIC
ppOpen-SYS FT
Specify
The Best
Execution
Allocations
Software Architecture of ppOpen-HPC

Overview
Project
2. ppOpen-AT Basics
Application
5. Conclusion
7

Overview of FIBER (Framework of Install‐time, Before
Execute‐time and Run‐time Auto‐tuning) [T.Katagiri et.al., 03]
#pragma oat …
Legacy codes with AT directives
#pragma oat …
#pragma oat …
Preprocessor
of the AT
directives
#implementation3
#implementation2
#implementation1
Legacy codes with AT functions and
AT candidates specified by the AT directives
Compiling
FIBER
Auto‐tuner
Best
Parameters
Performance
Database
Install‐
time
Before
Execute‐
time
Run‐time
: AT timings defined by FIBER.
The timings are specified
by the AT directives.
API on FIBER
Executable
codes with AT
functions
User
Specifies
Parameter

A Scenario to Software Developers for
ppOpen-AT
9
Executable Code with
Optimization Candidates
and AT Function
Invocate dedicated
Preprocessor
Software
Developer
Description of AT by Using
ppOpen-AT
Program with AT
Functions
Optimization
that cannot be
established by
compilers
#pragma oat install unroll (i,j,k) region start
#pragma oat varied (i,j,k) from 1 to 8
for(i = 0 ; i < n ; i++){
for(j = 0 ; j < n ; j++){
for(k = 0 ; k < n ; k++){
A[i][j]=A[i][j]+B[i][k]*C[k][j]; }}}
#pragma oat install unroll (i,j,k) region end
■Automatic Generated
Functions
Optimization
Candidates
Performance Monitor
Parameter Search
Performance Modeling
Description By Software Developer
Optimizations for Source Codes,
Computer Resource, Power Consumption

Compiler Optimization and AT
1. Loop length is unclear in compile‐time.
 Optimal loop split and loop fusion are specified in run‐time.
 Run‐time compiling is on only research.
2. Loop split with data dependencies.
 Some loop splits require increase of computations or memory
space.
 Some compilers are providing directive, but the directive is not
standardized.
 Code optimization is not also standardized between compilers.
3. Restrictions from Operation in Supercomputers.
 Some supercomputer environments cannot supply required “software‐
stack”, or the software‐stack cannot be utilize due to restriction by operation.
 Out of target for the system due to hardware restriction.
 Ex) CAPS in the K‐computer.
 Operation costs (budgets), vender strategy, etc…. 10

Overview
Project
2. ppOpen-AT Basics
Application
5. Conclusion
11

EARLY EXPERIENCE IN
EXPLICIT METHOD
(FINITE DIFFERENCE
METHOD)
12

Target Application
Seism3D
: Simulation software for
seismic wave analysis.
 Strategic simulation software in Japan.
 Developed by Professor Furumura
at the University of Tokyo.
◦ The code is re-constructed as
ppOpen-APPL/FDM.
 Finite Differential Method (FDM)
 3D simulation
◦ 3D arrays are allocated.
 Data type: Single Precision (real*4)
13
Source: http://www.eri.u-
tokyo.ac.jp/furumura/tsunami
/tsunami.html

The Heaviest Loop (20%+ to Total Time)
14
!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,
DXVZDZVX1,DYVZDZV1)
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
RL1 = LAM (I,J,K)
RM1 = RIG (I,J,K)
RM2 = RM1 + RM1; RLRM2 = RL1+RM2;
DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K)
DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1
SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DT
SYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DT
SZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DT
DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K)
DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);
DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)
SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT
SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT
SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
END DO
END DO
END DO
!$omp end parallel do
A Flow Dependency

Optimization Possibilities
 Loop Splitting
◦ To reduce spill code.
◦ To maximize register usage.
 Loop fusion (Loop Collapse)
◦ 3 nested loop -> The following two approaches.
◦ One nest loop
 To increase outer loop parallelism for thread
parallelism.
◦ Two nested loop
 To increase outer loop parallelism for thread
parallelism.
 To utilize pre-fetching for the inner loop.
15

Loop fusion –
One dimensional (a loop collapse)
16
DXVZDZVX1,DYVZDZV1)
DO KK = 1, NZ * NY * NX
K = (KK-1)/(NY*NX) + 1
J = mod((KK-1)/NX,NY) + 1
I = mod(KK-1,NX) + 1
RL1 = LAM (I,J,K)
RM1 = RIG (I,J,K)
END DO
Merit: Loop length is huge.
This is good for OpenMP thread parallelism.

Loop fusion – Two dimensional
17
DXVZDZVX1,DYVZDZV1)
DO KK = 1, NZ * NY
K = (KK-1)/NY + 1
J = mod(KK-1,NY) + 1
DO I = 1, NX
RL1 = LAM (I,J,K)
RM1 = RIG (I,J,K)
ENDDO
END DO
 Example:
Merit: Loop length is huge.
This is good for OpenMP thread parallelism.
This I-loop enables us an opportunity of pre-fetching.

18
DXVZDZVX1,DYVZDZV1)
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
RL1 = LAM (I,J,K)
RM1 = RIG (I,J,K)
ENDDO
DO I = 1, NX
RM1 = RIG (I,J,K)
END DO
END DO
END DO
Re-computation
(a copy) is needed.
⇒Compilers
do not apply it
without directive.
Perfect Splitting: Two 3-nested Loops

New Directives for ppOpen‐AT
• m_stress.f90（ppohFDM_update_stress）
!OAT$ install LoopFusionSplit region start
!$omp parallel do private (k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1)
do k = NZ00, NZ01
do j = NY00, NY01
do i = NX00, NX01
RL1   = LAM (I,J,K)
!OAT$ SplitPointCopyDef sub region start
RM1   = RIG (I,J,K)
!OAT$ SplitPointCopyDef sub region end
DXVX1 = DXVX(I,J,K);  DYVY1 = DYVY(I,J,K); DZVZ1 = DZVZ(I,J,K)
D3V3  = DXVX1 + DYVY1 + DZVZ1
SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)‐RM2*(DZVZ1+DYVY1) ) * DT
SYY (I,J,K) = SYY (I,J,K)  + (RLRM2*(D3V3)‐RM2*(DXVX1+DZVZ1) ) * DT
SZZ (I,J,K) = SZZ (I,J,K)  + (RLRM2*(D3V3)‐RM2*(DXVX1+DYVY1) ) * DT
!OAT$ SplitPoint (K,J,I)
!OAT$ SplitPointCopyInsert
DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K);    DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);
end do
end do
end do
!OAT$ install LoopFusionSplit region end
Re-calculation
is defined in here.
Using the re-calculation
is defined in here.
Loop Split Point

Candidates of Auto-generated Codes
 #1 [Baseline] : Original three-nested loop.
 #2 [Spilt] : Loop split for the k-loop
(separated two three-nested loops).
 #3 [Split] : Loop split for the j-loop.
 #4 [Split] : Loop split for the i-loop.
 #5 [Fusion] : Loop fusion for the k-loop and j-loop
(a two-nested loop).
 #6 [Split and Fusion] : Loop fusions for the k-loop
and j-loop for the loops in #2.
 #7 [Fusion] : Loop fusions for the k-loop, j-loop,
and i-loop (loop collapse).
 #8 [Split and Fusion] : Loop fusions for the k-loop, j-loop,
and i-loop for the loops in #2
(loop collapses for the separated two-loops).
20

Overview
Project
2. ppOpen-AT Basics
Application
5. Conclusion
21

An Example of Seism3D Simulation
 West part earthquake in Tottori prefecture in Japan
at year 2000. ([1], pp.14)
 The region of 820km x 410km x 128 km is discretized with 0.4km.
 NX x NY x NZ = 2050 x 1025 x 320 ≒ 6.4 : 3.2 : 1.
[1] T. Furumura, “Large-scale Parallel FDM Simulation for Seismic Waves and Strong Shaking”, Supercomputing News,
Information Technology Center, The University of Tokyo, Vol.11, Special Edition 1, 2009. In Japanese.
Figure : Seismic wave translations in west part earthquake in Tottori prefecture in Japan.
(a) Measured waves; (b) Simulation results; (Reference : [1] in pp.13)

Test Condition
 Software version
◦ ppOpen-APPL/FDM version 0.2
◦ ppOpen-AT version 0.2
 Target Kernels in ppOpen-APPL/FDM
◦ TOP 10 Kernels (All three-nested loops)
 Update_stress
 Update_vel
 Update_spong
 Other 7 kernels in finite differential computations.
 AT Timing
◦ Before Execute-time
 After fixing problem size and the number of threads by user.
 Then, adapt AT in time for calling of the library routine.
 All candidates of AT are evaluated. (Brute-force search)
◦ Only 8+3+6+7*3 = 38 candidates.
 #Repeats for each kernel in the AT mode
◦ 100 times
24

The Xeon Phi Cluster System
 Intel Xeon (Ivy Bridge) : HOST CPU
 OS：Red Hat Enterprise Linux Server release 6.2
 #Nodes：32 (Available: 14 nodes)
 CPU：Intel Xeon E5‐2670 V2 @ 2.50GHz，2 sockets×10 cores
 Hyper Threading：ON
 Theoretical Peak Performance for 1 node of CPU：400 GFLOPS
 Memory size on 1 node：64 GB
 Interconnect：Infiniband
 Compiler：Intel Fortran version 14.0.0.080 Build 20130728
 Compiler Option：‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel
 KMP_AFFINITY=granularity=fine, compact (all threads are on socket)
 Intel Xeon Phi co‐processor (Xeon Phi) : Accelerator
 CPU：Xeon Phi 5110P (B1 stepping) 1.053 GHz，60 core
 Memory size：8 GB
 Theoretical Peak Performance ：1 TFLOPS ( = 1.053 GHz x 16 FLOPS x 60 core)
 Connected one board on each node of the Cluster
 Native mode
 Compiler：Intel Fortran version 14.0.0.080 Build 20130728
 Compiler Option：‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel
–mmiccl ‐align array64byte
 KMP_AFFINITY=granularity=fine, balanced (all threads are equally distributed on
socket)

Execution Details
• ppOpen‐APPL/FDM ver.0.2
• ppOpen‐AT ver.0.2
• Target Problem Size
– NX * NY * NZ = 256 x 96 x 100 / node
– NX * NY * NZ = 32 * 16 * 20 / core (!= per MPI Process)
• Native mode for MIC
• Target MPI Processes and Threads on the Xeon Phi
– 1 node of the Xeon Phi with 4 HT (Hyper Threading)
– PXTY : X MPI Processes and Y Threads per process
– P240T1 : pure MPI with 4HT per core
– P120T2
– P60T4
– P16T15
– P8T30 : Minimum Hybrid MPI‐OpenMP execution for
ppOpen‐APPL/FDM, since it needs minimum 8 MPI Processes.
• The number of iterations for the kernels: 100

2.11
2.32 2.33
2.96 3.14
1.29
1.70 1.74 1.91 1.97
0
1
2
3
4
P240T1 P120T2 P60T4 P16T15 P8T30
Without AT With AT
AT Effect (update_stress, Xeon Phi)[Seconds]
KMP_AFFINITY=balanced
‐align array64byte New Kernels
1.63
1.36 1.34
1.55 1.59
0
0.5
1
1.5
2
P240T1 P120T2 P60T4 P16T15 P8T30
Speedups
Best SW: 6 Best SW: 5 Best SW: 5 Best SW: 5 Best SW: 6

Conclusion
 Loop fusion to obtain high parallelism
is one of key techniques for current
multi- and many-core architectures.
◦ Execution with 240 threads/MPI process
in the Xeon Phi.
◦ Strong scaling with more than 10,000+ cores
in the FX10.
 To do AT in supercomputers
in operation, minimizing requirement
of “software-stack” is a practical way
to establish AT.

ppOpen-AT is free software!
ppOpen-AT version 0.2 is
available!
The licensing is MIT.
Please access the following page:
http://ppopenhpc.cc.u-tokyo.ac.jp/
30

31
Thank for your attention!
Questions?

Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -

Similar to Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements - (20)

Recently uploaded

Recently uploaded (20)

Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER approach and minimizing software-stack requirements -