Porting and Optimization of Numerical Libraries for ARM SVE

State of Scalasca!
Itaru Kitayama
RIKEN AICS

What’s Scalasca!
•  Parallel application (MPI + OpenMP) performance study toolset
•  Open source, 3-clause BSD license
•  Portable implementation
•  IBM Blue Gene, Cray XT/XE/XK/XC, SGI Altix, Fujitsu FX10/100, K
computer, Linux (x86, Power, ARM), Intel Xeon Phi
•  Depends on Score-P instrumenter & measurement libraries
•  Supports common data formats
•  Reads event traces in OTF2 format
•  Write analysis reports in CUBE4 format
2

Score-P!
Scalasca trace analysis
3
Scalasca workflow
Instr.
target
application
Measurement
library
HWC!
Parallel wait-
state search!
Wait-state
report!
Local event
traces!
Summary
report!
Optimized measurement configuration
Instrumenter
compiler / linker!
Instrumente
d executable!
Source
modules!
Report
manipulation!
Which problem?!
Where in the
program?!
Which
process?!

Scalasca Status!
•  Scalasca can be used for parallel application performance
studies on arm64
•  GNU Autotools are updated to recognize the arm64
architecture
•  Latest stable version is 2.3.1 (May 2016)
•  Cube v4.4 is upcoming
•  Sampling mode for arm64 is being worked upon
•  Bug ﬁxes and enhancements will be coming
4

Sampling Mode!
•  Important to avoid excessive overhead due to
instrumentation
•  Requires libunwind package
•  POSIX timer, perf, PAPI are the sources of interrupt
•  Works on the x86, the work is on-going on arm64
•  Issue: PLT-entry resolved address passed to libunwind
does not work as expected
5

libunwind test results on arm64!
============================================================================
Testsuite summary for libunwind 1.3-rc1
============================================================================
# TOTAL: 35
# PASS: 26
# SKIP: 0
# XFAIL: 0
# FAIL: 9
# XPASS: 0
# ERROR: 0
============================================================================
See tests/test-suite.log
Please report to libunwind-devel@nongnu.org
============================================================================
As of 1.3-rc1 AArch64 “Works well”
$ make check on arm64 produces:
•  kernel: 4.14
•  gcc: 4.8.5 20150623 (Red Hat 4.8.5-16)
•  glib: 2.17
•  hardware: Cavium ThunderX
6

Cube Status!
•  Release v4.4 is upcoming
•  Major changes since stable v4.3:
•  Packaging
•  Many plugins for customized performance analysis
•  KNL vectorization adviser
•  OTF2 Trace visualizer
•  Sunburst
•  ScorePion
•  Memory footprint reduction (to appear in v4.4 or after)
•  http://www.scalasca.org/software/cube-4.x/download.html
7

Snapshot of Cube GUI on ThunderX!
8

NPB3.3-MZ-MPI/BT Exercise on ThunderX!
•  NAS Parallel Bench suite (sample MZ-MPI version)
•  Available from http://www.nas.nasa.gov/Software/NPB
•  3 benchmarks (all in Fortran77, using OpenMP+MPI)
•  Conﬁgurable for various sizes & classes
9

NPB-MZ-MPI/BT (Block Tridiagonal Solver)!
10
•  What does it do?
•  Solves a discretized version of unsteady, compressible Navier-
Stokes equations in three spatial dimensions
•  Performs 200 time-steps on a regular 3-dimensional grid using
ADI and veriﬁes solution error within acceptable limit
•  Intra-zone computation with OpenMP, inter-zone with MPI
•  Implemented in 20 or so Fortran77 source modules
•  Runs with any number of MPI processes & OpenMP threads
•  On ThunderX, bt-mz_B.16 x6 should run in 30 seconds
•  CLASS=B is recommended

NPB-MZ-MPI/BT proﬁle execution!
11
•  Set OMP_NUM_THREDS and launch as an MPI application
-bash-4.2$ scan -s mpiexec -np 16 ./bt-mz.B.16
S=C=A=N: Scalasca 2.3.1 runtime summarization
S=C=A=N: ./scorep_bt-mz_16x6_sum experiment archive
S=C=A=N: Sat Dec 2 12:30:05 2017: Collect start
/home/itaru/opt/openmpi-2.1.1/bin/mpiexec -np 16 ./bt-mz.B.16

NAS Parallel Benchmarks (NPB3.3-MZ-MPI) - BT-MZ MPI+OpenMP Benchmark

Number of zones: 8 x 8
Iterations: 200 dt: 0.000300
Number of active processes: 16

Use the default load factors with threads
Total number of threads: 96 ( 6.0 threads/process)

Calculated speedup = 93.84

Time step 1
Time step 20
Time step 40
Time step 60
Time step 80
Time step 100
Time step 120
Time step 140
Time step 160
Time step 180
Time step 200
Veriﬁcation being performed for class B
accuracy setting for epsilon = 0.1000000000000E-07
[…]
S=C=A=N: Sat Dec 2 12:30:44 2017: Collect done (status=0) 39s
S=C=A=N: ./scorep_bt-mz_16x6_sum complete.

NPB-MZ-MPI/BT build configuration definition!
12
# F77 - Fortran compiler
# FFLAGS - Fortran compilation arguments
# F_INC - any -I arguments required for compiling Fortran
# FLINK - Fortran linker
# FLINKFLAGS - Fortran linker arguments
# F_LIB - any -L and -l arguments required for linking Fortran
#
# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
# $(F77) $(FFLAGS)
# linking is done with $(FLINK) $(F_LIB) $(FLINKFLAGS)
#------------------------------------------------------------------------
---
#------------------------------------------------------------------------
---
# This is the fortran compiler used for fortran programs
#------------------------------------------------------------------------
---
F77 = scorep mpif77
#F77 = mpif77
•  config/make.def
Score-P wrapper, just before the
compiler!

Cube Data Representation!
13
6 threads
16 Ranks

Summary!
•  Scalasca and Score-P have been ported to arm64 and
tools are working ﬁne on real hardware
•  Missing feature is sampling
•  Data visualization and analysis framework will be updated
14

Thanks to!
•  Markus Geimer (JSC)
•  Pavel Saviankou (JSC)
•  Brian Wylie (JSC)
•  Michael Knobloch (JSC)
•  Scalasca and Score-P Communites
15

Porting and Optimization of Numerical Libraries for ARM SVE

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Porting and Optimization of Numerical Libraries for ARM SVE

Similar to Porting and Optimization of Numerical Libraries for ARM SVE (20)

More from Linaro

More from Linaro (20)

Recently uploaded

Recently uploaded (20)

Porting and Optimization of Numerical Libraries for ARM SVE