Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems

1,052 views

Published on

The slide of ESPM2 2018 presentation.

Published in: Science
  • Be the first to comment

  • Be the first to like this

ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems

  1. 1. Automatic Generation of High- Order Finite-Difference Code with Temporal Blocking For Extreme- Scale Many-Core Systems ESPM2 2018 Nov.12th, Dallas Hideyuki Tanaka*, Youhei Ishihara, Ryo Sakamoto, Takashi Nakamura, Yasuyuki Kimura, Keigo Nitadori, Miyuki Tsubouchi, Jun Makino
  2. 2. Abstract  For an explicit finite-difference scheme applied to computation fluid dynamics, we have achieved 4.78 PFlops, 21.5% efficiency of peak performance on the large-scale PEZY-SC2 based system which has very low B/F by temporal blocking  The achieved efficiency is comparable to recent works on very high B/F systems  To achieve this high efficiency on a low B/F machine, we developed  A framework for explicit stencil computation which generates the boilerplate code for MPI and device kernel code with temporal blocking  A finite-difference scheme suitable for temporal blocking
  3. 3. Table of Contents  Introduction  Explicit stencil computation  Temporal blocking  About PEZY-SC2  Details of our work  Code generation framework: Formura  Optimization for PEZY-SC2  Benchmark results  Performance on large-scale systems (Gyoukou)  Discussion and summary
  4. 4. Introduction
  5. 5. Explicit Stencil Computation  Explicit stencil computation is simple but very important application of HPC  It is used for simulating weather, earthquake, inside of the sun, etc.  Optimizing stencil computation is very important Source: Riken
  6. 6. Efficiency of Recent Stencil Computation  Efficiency of explicit method on recent HPC hardware is not high enough  Even the best case efficiency on K computer is ≅ 20%, other many cases are only ≅ 10% (not high enough)  This low efficiency is caused by the problem in the architecture of processors, or memory bandwidth  We try to solve the problem of memory bandwidth which does not depend on architecture  In the past decades, B/F of HPC systems has been reduced dramatically  This trend seems likely to continue
  7. 7. Relative Performance Trend  Green: FLOPS vs memory bandwidth (4.5x/decade)  Red: FLOPS vs network latency (~30x/decade)  This seems to continue Source: John D. McCalpin
  8. 8. PEZY-SC2  Many core MIMD processor  1984 individual RISC cores  2.8TFlops peak DP performance (@700MHz)  4ch DDR4 DRAM, 64GB, 80GB/s  ⇒ B/F≒0.03  cf. K computer: 0.5 Tesla V100: 0.12 TaihuLight: 0.04
  9. 9. PEZY-SC2 Architecture  The chip consists of 8 prefectures  Each prefecture contains 16 cities  Each city contains 16 processor elements (PEs)  8×16×16-64(redun- dancy) = 1984PEs  Each city shares L2 cache
  10. 10. Gyoukou  Supercomputer installed at JAMSTEC, Japan  (Available until April 2018)  Peak 28.2PFlops (Full nodes)  Top500 4th (Nov 2017)  10000 PEZY-SC2s + 1250 Xeon D (1 for 8-SC2s)  World’s largest numbers of MIMD processor cores (≒ 20M cf. TaihuLight ≒ 11M)  Suitable for the test to check if the code can scale to exa-scale systems
  11. 11. Details of the work
  12. 12. Temporal Blocking (TB)  One of the solution to explicit method on low B/F system  With TB, multiple timesteps are calculated for working array, so it can reduce required B/F when the working array fits to the processor cache memory size DRAM DRAM Cache Cache Cache Cache Computation for one time step Read from memory Write to memory Network Network ・・・
  13. 13. Various Methods of TB A variation of this used for inter-node communication This is used for in-node computation
  14. 14. Detail of Our TB Calculation  Inter-node communication  Each node sends data to one-direction  Each node receives data from one-direction  Simple communication-computation overwrapping
  15. 15. Details of Our TB calculation  Computation starts from the right-most block  Upper-right of the parallelogram use dummy data to equalize all loop lengths  Gray part is unnecessary results  This method increases few computation
  16. 16. SL4TH3 Scheme  Fourth order accuracy  Number of stencil = 2  Flop per cell per step ≒ 2800  Required B/F ≒ 0.05 w/o TB
  17. 17. Input Differential Equation
  18. 18. Formura: a Framework for TB  From a description of a stencil written in formura DSL, optimized distributed parallel codes for large scale parallel computers are generated  In this work, we add the support of the TB method for this work, and developed a device kernel code generator for PEZY-SC2 Formura DSL MPI driver code TB kernel code Executable formura gcc/mpicc formura
  19. 19. Code generation by Formura Input Equation (Needs some more configuration files) (Very part of) Generated C Codes ・・・ Output (Zoomed-in)
  20. 20. Code Generation  Formura generates:  Driver code for TB distributing on MPI  Optimized kernel codes for node-local computations  For new accelerator (or any other processors), we can add a backend by modifying the code that calculates temporal blocking steps  Typically, major optimizations for each device are blocking layout for data access locality and thread scheduling
  21. 21. Optimization for PEZY-SC2  Decide the block size  Size of block is smaller than LLC size  Parallelism close to number of PEs for load-balancing  44× 44× 44 is the best block size  443×10(variables/cell)×8Byte = 6.50MB  6.50MB×2 (for overwrapping read and write)<32MB=LLC size  442 (parallelisms) = 1936≒ 1984(number of PEs)  Total 880(=44×20)3 cells per node < 64GB  Allocate adjacent cells in PEs which shares L2  Decrease inner-most loop instruction size  PEZY-SC2’s L1 I-Cache size = 4KB (= 1024ops)  SL4TH3 requires 2800ops/cell
  22. 22. Results
  23. 23. Benchmark Results  Conditions  SL4TH3 scheme  Optimized backend for PEZY-SC2  8000 PEYZ-SC2s (20× 20× 20 layout) on Gyoukou  ≒ 16M cores  Total (880×20=)17600 3 cells  Performance results  4.78 PFlops  21.5% efficiency (22.2PFlops theoretical-peak)
  24. 24. Effect of Temporal Blocking NT Redundant calculation by TB Required B/F 1 1.4% 0.058 2 2.7% 0.029 3 4.0% 0.020 4 5.4% 0.015 5 6.7% 0.012 6 8.0% 0.010 7 9.2% 0.009 8 10.5% 0.008 Required B/F by NT (size per node = 8803) Calculation speed by NT GF NT Calculation speed by NT (= time step parameter)
  25. 25. Comparison with Other Studies  This work achieves very high efficiency  Comparable result with very high B/F (= 0.5) system 0 0.1 0.2 0.3 0.4 0.5 0.6 0 5 10 15 20 25 30 Yashiro et. al. Yang et. al. Hotta et. al. This work B/F Efficiency(%) Efficiency 列 1 Device B/F
  26. 26. Weak Scaling  The communication is completely hidden by computation  Thus even though the actual time for communication increase when we increase the number of nodes, weak scaling of the performance is pretty good Communication time Total time
  27. 27. Weak Scaling
  28. 28. Future Works  Other schemes  HLLD  Other application  Tsunami (shallow-water equation)  Reaction-diffusion system  Further performance improvement
  29. 29. Conclusion  We have achieved 4.78 PFlops, 21.5% efficiency of peak performance on the fluid simulation code on the large- scale PEZY-SC2 based system  We developed an automatic code generation framework for TB, a scheme suitable for it and a backend for PEZY- SC2 accelerator  Our achieved efficiency is comparable to other works on high B/F systems

×