Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Porting and Optimization of Numerical Libraries for ARM SVE

1,197 views

Published on

By Toshiyuki Imamura, RIKEN AICS

RIKEN and Fujitsu are developing ARM-based numerical libraries optimized with the new feature of ARM-SVE. We present porting status of netlib+SSL-II for ARM-SVE and other OSS. Also, we demonstrate some optimization policies and techniques, especially for the basic numerical linear algebra kernels.

Toshiyuki Imamura Bio
Toshiyuki Imamura is currently a team leader of Large-scale Parallel Numerical Computing Technology at Advanced Institute for Computational Science (AICS), RIKEN. He is in charge of the development of numerical libraries for the post-K project. His research interests include high-performance computing, automatic-tuning technology, eigenvalue computation (algorithm/software/applications), etc. He and his colleagues (Japan Atomic Energy Agency (JAEA) team) were nominated as one of the finalists of Gordon Bell Prize in SC05 and SC06. He is a member of IPSJ, JSIAM, and SIAM.

Email
imamura.toshiyuki@riken.jp

For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/

Published in: Technology
  • Be the first to comment

Porting and Optimization of Numerical Libraries for ARM SVE

  1. 1. State of Scalasca! Itaru Kitayama RIKEN AICS
  2. 2. What’s Scalasca! •  Parallel application (MPI + OpenMP) performance study toolset •  Open source, 3-clause BSD license •  Portable implementation •  IBM Blue Gene, Cray XT/XE/XK/XC, SGI Altix, Fujitsu FX10/100, K computer, Linux (x86, Power, ARM), Intel Xeon Phi •  Depends on Score-P instrumenter & measurement libraries •  Supports common data formats •  Reads event traces in OTF2 format •  Write analysis reports in CUBE4 format 2
  3. 3. Score-P! Scalasca trace analysis 3 Scalasca workflow Instr. target application Measurement library HWC! Parallel wait- state search! Wait-state report! Local event traces! Summary report! Optimized measurement configuration Instrumenter compiler / linker! Instrumente d executable! Source modules! Report manipulation! Which problem?! Where in the program?! Which process?!
  4. 4. Scalasca Status! •  Scalasca can be used for parallel application performance studies on arm64 •  GNU Autotools are updated to recognize the arm64 architecture •  Latest stable version is 2.3.1 (May 2016) •  Cube v4.4 is upcoming •  Sampling mode for arm64 is being worked upon •  Bug fixes and enhancements will be coming 4
  5. 5. Sampling Mode! •  Important to avoid excessive overhead due to instrumentation •  Requires libunwind package •  POSIX timer, perf, PAPI are the sources of interrupt •  Works on the x86, the work is on-going on arm64 •  Issue: PLT-entry resolved address passed to libunwind does not work as expected 5
  6. 6. libunwind test results on arm64! ============================================================================ Testsuite summary for libunwind 1.3-rc1 ============================================================================ # TOTAL: 35 # PASS: 26 # SKIP: 0 # XFAIL: 0 # FAIL: 9 # XPASS: 0 # ERROR: 0 ============================================================================ See tests/test-suite.log Please report to libunwind-devel@nongnu.org ============================================================================ As of 1.3-rc1 AArch64 “Works well” $ make check on arm64 produces: •  kernel: 4.14 •  gcc: 4.8.5 20150623 (Red Hat 4.8.5-16) •  glib: 2.17 •  hardware: Cavium ThunderX 6
  7. 7. Cube Status! •  Release v4.4 is upcoming •  Major changes since stable v4.3: •  Packaging •  Many plugins for customized performance analysis •  KNL vectorization adviser •  OTF2 Trace visualizer •  Sunburst •  ScorePion •  Memory footprint reduction (to appear in v4.4 or after) •  http://www.scalasca.org/software/cube-4.x/download.html 7
  8. 8. Snapshot of Cube GUI on ThunderX! 8
  9. 9. NPB3.3-MZ-MPI/BT Exercise on ThunderX! •  NAS Parallel Bench suite (sample MZ-MPI version) •  Available from http://www.nas.nasa.gov/Software/NPB •  3 benchmarks (all in Fortran77, using OpenMP+MPI) •  Configurable for various sizes & classes 9
  10. 10. NPB-MZ-MPI/BT (Block Tridiagonal Solver)! 10 •  What does it do? •  Solves a discretized version of unsteady, compressible Navier- Stokes equations in three spatial dimensions •  Performs 200 time-steps on a regular 3-dimensional grid using ADI and verifies solution error within acceptable limit •  Intra-zone computation with OpenMP, inter-zone with MPI •  Implemented in 20 or so Fortran77 source modules •  Runs with any number of MPI processes & OpenMP threads •  On ThunderX, bt-mz_B.16 x6 should run in 30 seconds •  CLASS=B is recommended
  11. 11. NPB-MZ-MPI/BT profile execution! 11 •  Set OMP_NUM_THREDS and launch as an MPI application -bash-4.2$ scan -s mpiexec -np 16 ./bt-mz.B.16 S=C=A=N: Scalasca 2.3.1 runtime summarization S=C=A=N: ./scorep_bt-mz_16x6_sum experiment archive S=C=A=N: Sat Dec 2 12:30:05 2017: Collect start /home/itaru/opt/openmpi-2.1.1/bin/mpiexec -np 16 ./bt-mz.B.16 NAS Parallel Benchmarks (NPB3.3-MZ-MPI) - BT-MZ MPI+OpenMP Benchmark Number of zones: 8 x 8 Iterations: 200 dt: 0.000300 Number of active processes: 16 Use the default load factors with threads Total number of threads: 96 ( 6.0 threads/process) Calculated speedup = 93.84 Time step 1 Time step 20 Time step 40 Time step 60 Time step 80 Time step 100 Time step 120 Time step 140 Time step 160 Time step 180 Time step 200 Verification being performed for class B accuracy setting for epsilon = 0.1000000000000E-07 […] S=C=A=N: Sat Dec 2 12:30:44 2017: Collect done (status=0) 39s S=C=A=N: ./scorep_bt-mz_16x6_sum complete.
  12. 12. NPB-MZ-MPI/BT build configuration definition! 12 # F77 - Fortran compiler # FFLAGS - Fortran compilation arguments # F_INC - any -I arguments required for compiling Fortran # FLINK - Fortran linker # FLINKFLAGS - Fortran linker arguments # F_LIB - any -L and -l arguments required for linking Fortran # # compilations are done with $(F77) $(F_INC) $(FFLAGS) or # $(F77) $(FFLAGS) # linking is done with $(FLINK) $(F_LIB) $(FLINKFLAGS) #------------------------------------------------------------------------ --- #------------------------------------------------------------------------ --- # This is the fortran compiler used for fortran programs #------------------------------------------------------------------------ --- F77 = scorep mpif77 #F77 = mpif77 •  config/make.def Score-P wrapper, just before the compiler!
  13. 13. Cube Data Representation! 13 6 threads 16 Ranks
  14. 14. Summary! •  Scalasca and Score-P have been ported to arm64 and tools are working fine on real hardware •  Missing feature is sampling •  Data visualization and analysis framework will be updated 14
  15. 15. Thanks to! •  Markus Geimer (JSC) •  Pavel Saviankou (JSC) •  Brian Wylie (JSC) •  Michael Knobloch (JSC) •  Scalasca and Score-P Communites 15

×