SlideShare a Scribd company logo
Automatic Generation of High-
Order Finite-Difference Code with
Temporal Blocking For Extreme-
Scale Many-Core Systems
ESPM2 2018
Nov.12th, Dallas
Hideyuki Tanaka*, Youhei Ishihara, Ryo Sakamoto,
Takashi Nakamura, Yasuyuki Kimura, Keigo Nitadori,
Miyuki Tsubouchi, Jun Makino
Abstract
 For an explicit finite-difference scheme applied to computation
fluid dynamics, we have achieved 4.78 PFlops, 21.5% efficiency
of peak performance on the large-scale PEZY-SC2 based system
which has very low B/F by temporal blocking
 The achieved efficiency is comparable to recent works on very
high B/F systems
 To achieve this high efficiency on a low B/F machine,
we developed
 A framework for explicit stencil computation which generates
the boilerplate code for MPI and device kernel code with
temporal blocking
 A finite-difference scheme suitable for temporal blocking
Table of Contents
 Introduction
 Explicit stencil computation
 Temporal blocking
 About PEZY-SC2
 Details of our work
 Code generation framework: Formura
 Optimization for PEZY-SC2
 Benchmark results
 Performance on large-scale systems (Gyoukou)
 Discussion and summary
Introduction
Explicit Stencil Computation
 Explicit stencil computation is simple but very important
application of HPC
 It is used for simulating weather, earthquake, inside of
the sun, etc.
 Optimizing stencil computation is very important
Source: Riken
Efficiency of Recent Stencil Computation
 Efficiency of explicit method on recent HPC
hardware is not high enough
 Even the best case efficiency on K computer is ≅ 20%,
other many cases are only ≅ 10% (not high enough)
 This low efficiency is caused by the problem in the
architecture of processors, or memory bandwidth
 We try to solve the problem of memory bandwidth
which does not depend on architecture
 In the past decades, B/F of HPC systems has been
reduced dramatically
 This trend seems likely to continue
Relative Performance Trend
 Green: FLOPS vs
memory bandwidth
(4.5x/decade)
 Red: FLOPS vs
network latency
(~30x/decade)
 This seems to
continue
Source: John D. McCalpin
PEZY-SC2
 Many core MIMD processor
 1984 individual RISC cores
 2.8TFlops peak DP performance (@700MHz)
 4ch DDR4 DRAM, 64GB, 80GB/s
 ⇒ B/F≒0.03
 cf. K computer: 0.5
Tesla V100: 0.12
TaihuLight: 0.04
PEZY-SC2 Architecture
 The chip consists of
8 prefectures
 Each prefecture
contains 16 cities
 Each city contains 16
processor elements
(PEs)
 8×16×16-64(redun-
dancy) = 1984PEs
 Each city shares L2
cache
Gyoukou
 Supercomputer installed at JAMSTEC, Japan
 (Available until April 2018)
 Peak 28.2PFlops (Full nodes)
 Top500 4th (Nov 2017)
 10000 PEZY-SC2s +
1250 Xeon D (1 for 8-SC2s)
 World’s largest numbers of
MIMD processor cores
(≒ 20M
cf. TaihuLight ≒ 11M)
 Suitable for the test to check
if the code can scale to exa-scale systems
Details of the work
Temporal Blocking (TB)
 One of the solution to explicit method on low B/F system
 With TB, multiple timesteps are calculated for working
array, so it can reduce required B/F when the working
array fits to the processor cache memory size
DRAM DRAM
Cache Cache Cache Cache
Computation
for one time step
Read from memory Write to memory
Network Network
・・・
Various Methods of TB
A variation of this used for
inter-node communication
This is used for
in-node computation
Detail of Our TB Calculation
 Inter-node communication
 Each node sends data to one-direction
 Each node receives data from one-direction
 Simple communication-computation overwrapping
Details of Our TB calculation
 Computation starts from the right-most block
 Upper-right of the parallelogram use dummy data to
equalize all loop lengths
 Gray part is unnecessary results
 This method increases few computation
SL4TH3 Scheme
 Fourth order accuracy
 Number of stencil = 2
 Flop per cell per step ≒ 2800
 Required B/F ≒ 0.05 w/o TB
Input Differential Equation
Formura: a Framework for TB
 From a description of a stencil written in formura DSL,
optimized distributed parallel codes for large scale
parallel computers are generated
 In this work, we add the support of the TB method for
this work, and developed a device kernel code generator
for PEZY-SC2
Formura
DSL
MPI driver
code
TB kernel
code
Executable
formura gcc/mpicc
formura
Code generation by Formura
Input Equation
(Needs some more configuration files)
(Very part of) Generated C Codes
・・・
Output (Zoomed-in)
Code Generation
 Formura generates:
 Driver code for TB distributing on MPI
 Optimized kernel codes for node-local computations
 For new accelerator (or any other processors), we can
add a backend by modifying the code that calculates
temporal blocking steps
 Typically, major optimizations for each device are
blocking layout for data access locality and thread
scheduling
Optimization for PEZY-SC2
 Decide the block size
 Size of block is smaller than LLC size
 Parallelism close to number of PEs for load-balancing
 44× 44× 44 is the best block size
 443×10(variables/cell)×8Byte = 6.50MB
 6.50MB×2 (for overwrapping read and write)<32MB=LLC size
 442 (parallelisms) = 1936≒ 1984(number of PEs)
 Total 880(=44×20)3 cells per node < 64GB
 Allocate adjacent cells in PEs which shares L2
 Decrease inner-most loop instruction size
 PEZY-SC2’s L1 I-Cache size = 4KB (= 1024ops)
 SL4TH3 requires 2800ops/cell
Results
Benchmark Results
 Conditions
 SL4TH3 scheme
 Optimized backend for PEZY-SC2
 8000 PEYZ-SC2s (20× 20× 20 layout) on Gyoukou
 ≒ 16M cores
 Total (880×20=)17600 3 cells
 Performance results
 4.78 PFlops
 21.5% efficiency (22.2PFlops theoretical-peak)
Effect of Temporal Blocking
NT Redundant
calculation
by TB
Required B/F
1 1.4% 0.058
2 2.7% 0.029
3 4.0% 0.020
4 5.4% 0.015
5 6.7% 0.012
6 8.0% 0.010
7 9.2% 0.009
8 10.5% 0.008
Required B/F by NT
(size per node = 8803) Calculation speed by NT
GF
NT
Calculation speed by NT
(= time step parameter)
Comparison with Other Studies
 This work achieves very high efficiency
 Comparable result with very high B/F (= 0.5) system
0
0.1
0.2
0.3
0.4
0.5
0.6
0
5
10
15
20
25
30
Yashiro et. al. Yang et. al. Hotta et. al. This work
B/F
Efficiency(%)
Efficiency
列 1
Device B/F
Weak Scaling
 The communication is
completely hidden by
computation
 Thus even though the
actual time for
communication increase
when we increase the
number of nodes, weak
scaling of the performance
is pretty good
Communication time
Total time
Weak Scaling
Future Works
 Other schemes
 HLLD
 Other application
 Tsunami (shallow-water equation)
 Reaction-diffusion system
 Further performance improvement
Conclusion
 We have achieved 4.78 PFlops, 21.5% efficiency of peak
performance on the fluid simulation code on the large-
scale PEZY-SC2 based system
 We developed an automatic code generation framework
for TB, a scheme suitable for it and a backend for PEZY-
SC2 accelerator
 Our achieved efficiency is comparable to other works on
high B/F systems

More Related Content

What's hot

Flexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmeticFlexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmetic
Ieee Xpert
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
Design Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless Application
Design Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless ApplicationDesign Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless Application
Design Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless Application
International Journal of Engineering Inventions www.ijeijournal.com
 
A high performance fir filter architecture for fixed and reconfigurable appli...
A high performance fir filter architecture for fixed and reconfigurable appli...A high performance fir filter architecture for fixed and reconfigurable appli...
A high performance fir filter architecture for fixed and reconfigurable appli...
Ieee Xpert
 
Graph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designGraph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate design
Ieee Xpert
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)elliando dias
 
Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)
IISRT
 
High performance pipelined architecture of elliptic curve scalar multiplicati...
High performance pipelined architecture of elliptic curve scalar multiplicati...High performance pipelined architecture of elliptic curve scalar multiplicati...
High performance pipelined architecture of elliptic curve scalar multiplicati...
Ieee Xpert
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH
Lino Possamai
 
TinyML - 4 speech recognition
TinyML - 4 speech recognition TinyML - 4 speech recognition
TinyML - 4 speech recognition
艾鍗科技
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
Marina Kolpakova
 
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGAHigh-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
JAYAPRAKASH JPINFOTECH
 
Ad4103173176
Ad4103173176Ad4103173176
Ad4103173176
IJERA Editor
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
Netronome
 
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR TCM DECODERS
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR  TCM DECODERSHIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR  TCM DECODERS
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR TCM DECODERSLalitha Gosukonda
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
SciCompIIT
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Netronome
 
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISALec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Hsien-Hsin Sean Lee, Ph.D.
 
Tensorflow lite for microcontroller
Tensorflow lite for microcontrollerTensorflow lite for microcontroller
Tensorflow lite for microcontroller
Rouyun Pan
 

What's hot (20)

Flexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmeticFlexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmetic
 
B1030610
B1030610B1030610
B1030610
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Design Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless Application
Design Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless ApplicationDesign Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless Application
Design Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless Application
 
A high performance fir filter architecture for fixed and reconfigurable appli...
A high performance fir filter architecture for fixed and reconfigurable appli...A high performance fir filter architecture for fixed and reconfigurable appli...
A high performance fir filter architecture for fixed and reconfigurable appli...
 
Graph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designGraph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate design
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
 
Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)
 
High performance pipelined architecture of elliptic curve scalar multiplicati...
High performance pipelined architecture of elliptic curve scalar multiplicati...High performance pipelined architecture of elliptic curve scalar multiplicati...
High performance pipelined architecture of elliptic curve scalar multiplicati...
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH
 
TinyML - 4 speech recognition
TinyML - 4 speech recognition TinyML - 4 speech recognition
TinyML - 4 speech recognition
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
 
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGAHigh-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
 
Ad4103173176
Ad4103173176Ad4103173176
Ad4103173176
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
 
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR TCM DECODERS
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR  TCM DECODERSHIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR  TCM DECODERS
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR TCM DECODERS
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
 
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISALec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
 
Tensorflow lite for microcontroller
Tensorflow lite for microcontrollerTensorflow lite for microcontroller
Tensorflow lite for microcontroller
 

Similar to ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems

Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Joshua Mora
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGenevsachde
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
AMD Developer Central
 
Nilesh ranpura systemmodelling
Nilesh ranpura systemmodellingNilesh ranpura systemmodelling
Nilesh ranpura systemmodellingObsidian Software
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
Ericsson
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
Wim Vanderbauwhede
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
LEGATO project
 
BURA Supercomputer
BURA SupercomputerBURA Supercomputer
BURA Supercomputer
SIMTEC Software and Services
 
Steen_Dissertation_March5
Steen_Dissertation_March5Steen_Dissertation_March5
Steen_Dissertation_March5Steen Larsen
 
Melp codec optimization using DSP kit
Melp codec optimization using DSP kitMelp codec optimization using DSP kit
Melp codec optimization using DSP kit
sohaibaslam207
 
Crypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M ProcessorsCrypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M Processors
Hannes Tschofenig
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPC
Slide_N
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
inside-BigData.com
 
7 eti pres
7 eti pres7 eti pres
7 eti pres
Raymond Kung
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmicguest40fc7cd
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
Wilhelm van Belkum
 
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Codemotion
 

Similar to ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems (20)

Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGene
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
 
Nilesh ranpura systemmodelling
Nilesh ranpura systemmodellingNilesh ranpura systemmodelling
Nilesh ranpura systemmodelling
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
BURA Supercomputer
BURA SupercomputerBURA Supercomputer
BURA Supercomputer
 
Steen_Dissertation_March5
Steen_Dissertation_March5Steen_Dissertation_March5
Steen_Dissertation_March5
 
Melp codec optimization using DSP kit
Melp codec optimization using DSP kitMelp codec optimization using DSP kit
Melp codec optimization using DSP kit
 
Crypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M ProcessorsCrypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M Processors
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPC
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
7 eti pres
7 eti pres7 eti pres
7 eti pres
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
 
DSP_Assign_1
DSP_Assign_1DSP_Assign_1
DSP_Assign_1
 
Anegdotic Maxeler (Romania)
  Anegdotic Maxeler (Romania)  Anegdotic Maxeler (Romania)
Anegdotic Maxeler (Romania)
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
 

More from Hideyuki Tanaka

Xpath in-lens
Xpath in-lensXpath in-lens
Xpath in-lens
Hideyuki Tanaka
 
IdrisでWebアプリを書く
IdrisでWebアプリを書くIdrisでWebアプリを書く
IdrisでWebアプリを書く
Hideyuki Tanaka
 
Yesod勉強会
Yesod勉強会Yesod勉強会
Yesod勉強会
Hideyuki Tanaka
 
C++コミュニティーの中心でC++をDISる
C++コミュニティーの中心でC++をDISるC++コミュニティーの中心でC++をDISる
C++コミュニティーの中心でC++をDISる
Hideyuki Tanaka
 
関数プログラミング入門
関数プログラミング入門関数プログラミング入門
関数プログラミング入門
Hideyuki Tanaka
 

More from Hideyuki Tanaka (8)

Xpath in-lens
Xpath in-lensXpath in-lens
Xpath in-lens
 
IdrisでWebアプリを書く
IdrisでWebアプリを書くIdrisでWebアプリを書く
IdrisでWebアプリを書く
 
手書きスライド
手書きスライド手書きスライド
手書きスライド
 
Monad tutorial
Monad tutorialMonad tutorial
Monad tutorial
 
Yesod勉強会
Yesod勉強会Yesod勉強会
Yesod勉強会
 
C++コミュニティーの中心でC++をDISる
C++コミュニティーの中心でC++をDISるC++コミュニティーの中心でC++をDISる
C++コミュニティーの中心でC++をDISる
 
関数プログラミング入門
関数プログラミング入門関数プログラミング入門
関数プログラミング入門
 
Icfp2009
Icfp2009Icfp2009
Icfp2009
 

Recently uploaded

Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
NoelManyise1
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
zeex60
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
ronaldlakony0
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
sonaliswain16
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 

Recently uploaded (20)

Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 

ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems

  • 1. Automatic Generation of High- Order Finite-Difference Code with Temporal Blocking For Extreme- Scale Many-Core Systems ESPM2 2018 Nov.12th, Dallas Hideyuki Tanaka*, Youhei Ishihara, Ryo Sakamoto, Takashi Nakamura, Yasuyuki Kimura, Keigo Nitadori, Miyuki Tsubouchi, Jun Makino
  • 2. Abstract  For an explicit finite-difference scheme applied to computation fluid dynamics, we have achieved 4.78 PFlops, 21.5% efficiency of peak performance on the large-scale PEZY-SC2 based system which has very low B/F by temporal blocking  The achieved efficiency is comparable to recent works on very high B/F systems  To achieve this high efficiency on a low B/F machine, we developed  A framework for explicit stencil computation which generates the boilerplate code for MPI and device kernel code with temporal blocking  A finite-difference scheme suitable for temporal blocking
  • 3. Table of Contents  Introduction  Explicit stencil computation  Temporal blocking  About PEZY-SC2  Details of our work  Code generation framework: Formura  Optimization for PEZY-SC2  Benchmark results  Performance on large-scale systems (Gyoukou)  Discussion and summary
  • 5. Explicit Stencil Computation  Explicit stencil computation is simple but very important application of HPC  It is used for simulating weather, earthquake, inside of the sun, etc.  Optimizing stencil computation is very important Source: Riken
  • 6. Efficiency of Recent Stencil Computation  Efficiency of explicit method on recent HPC hardware is not high enough  Even the best case efficiency on K computer is ≅ 20%, other many cases are only ≅ 10% (not high enough)  This low efficiency is caused by the problem in the architecture of processors, or memory bandwidth  We try to solve the problem of memory bandwidth which does not depend on architecture  In the past decades, B/F of HPC systems has been reduced dramatically  This trend seems likely to continue
  • 7. Relative Performance Trend  Green: FLOPS vs memory bandwidth (4.5x/decade)  Red: FLOPS vs network latency (~30x/decade)  This seems to continue Source: John D. McCalpin
  • 8. PEZY-SC2  Many core MIMD processor  1984 individual RISC cores  2.8TFlops peak DP performance (@700MHz)  4ch DDR4 DRAM, 64GB, 80GB/s  ⇒ B/F≒0.03  cf. K computer: 0.5 Tesla V100: 0.12 TaihuLight: 0.04
  • 9. PEZY-SC2 Architecture  The chip consists of 8 prefectures  Each prefecture contains 16 cities  Each city contains 16 processor elements (PEs)  8×16×16-64(redun- dancy) = 1984PEs  Each city shares L2 cache
  • 10. Gyoukou  Supercomputer installed at JAMSTEC, Japan  (Available until April 2018)  Peak 28.2PFlops (Full nodes)  Top500 4th (Nov 2017)  10000 PEZY-SC2s + 1250 Xeon D (1 for 8-SC2s)  World’s largest numbers of MIMD processor cores (≒ 20M cf. TaihuLight ≒ 11M)  Suitable for the test to check if the code can scale to exa-scale systems
  • 12. Temporal Blocking (TB)  One of the solution to explicit method on low B/F system  With TB, multiple timesteps are calculated for working array, so it can reduce required B/F when the working array fits to the processor cache memory size DRAM DRAM Cache Cache Cache Cache Computation for one time step Read from memory Write to memory Network Network ・・・
  • 13. Various Methods of TB A variation of this used for inter-node communication This is used for in-node computation
  • 14. Detail of Our TB Calculation  Inter-node communication  Each node sends data to one-direction  Each node receives data from one-direction  Simple communication-computation overwrapping
  • 15. Details of Our TB calculation  Computation starts from the right-most block  Upper-right of the parallelogram use dummy data to equalize all loop lengths  Gray part is unnecessary results  This method increases few computation
  • 16. SL4TH3 Scheme  Fourth order accuracy  Number of stencil = 2  Flop per cell per step ≒ 2800  Required B/F ≒ 0.05 w/o TB
  • 18. Formura: a Framework for TB  From a description of a stencil written in formura DSL, optimized distributed parallel codes for large scale parallel computers are generated  In this work, we add the support of the TB method for this work, and developed a device kernel code generator for PEZY-SC2 Formura DSL MPI driver code TB kernel code Executable formura gcc/mpicc formura
  • 19. Code generation by Formura Input Equation (Needs some more configuration files) (Very part of) Generated C Codes ・・・ Output (Zoomed-in)
  • 20. Code Generation  Formura generates:  Driver code for TB distributing on MPI  Optimized kernel codes for node-local computations  For new accelerator (or any other processors), we can add a backend by modifying the code that calculates temporal blocking steps  Typically, major optimizations for each device are blocking layout for data access locality and thread scheduling
  • 21. Optimization for PEZY-SC2  Decide the block size  Size of block is smaller than LLC size  Parallelism close to number of PEs for load-balancing  44× 44× 44 is the best block size  443×10(variables/cell)×8Byte = 6.50MB  6.50MB×2 (for overwrapping read and write)<32MB=LLC size  442 (parallelisms) = 1936≒ 1984(number of PEs)  Total 880(=44×20)3 cells per node < 64GB  Allocate adjacent cells in PEs which shares L2  Decrease inner-most loop instruction size  PEZY-SC2’s L1 I-Cache size = 4KB (= 1024ops)  SL4TH3 requires 2800ops/cell
  • 23. Benchmark Results  Conditions  SL4TH3 scheme  Optimized backend for PEZY-SC2  8000 PEYZ-SC2s (20× 20× 20 layout) on Gyoukou  ≒ 16M cores  Total (880×20=)17600 3 cells  Performance results  4.78 PFlops  21.5% efficiency (22.2PFlops theoretical-peak)
  • 24. Effect of Temporal Blocking NT Redundant calculation by TB Required B/F 1 1.4% 0.058 2 2.7% 0.029 3 4.0% 0.020 4 5.4% 0.015 5 6.7% 0.012 6 8.0% 0.010 7 9.2% 0.009 8 10.5% 0.008 Required B/F by NT (size per node = 8803) Calculation speed by NT GF NT Calculation speed by NT (= time step parameter)
  • 25. Comparison with Other Studies  This work achieves very high efficiency  Comparable result with very high B/F (= 0.5) system 0 0.1 0.2 0.3 0.4 0.5 0.6 0 5 10 15 20 25 30 Yashiro et. al. Yang et. al. Hotta et. al. This work B/F Efficiency(%) Efficiency 列 1 Device B/F
  • 26. Weak Scaling  The communication is completely hidden by computation  Thus even though the actual time for communication increase when we increase the number of nodes, weak scaling of the performance is pretty good Communication time Total time
  • 28. Future Works  Other schemes  HLLD  Other application  Tsunami (shallow-water equation)  Reaction-diffusion system  Further performance improvement
  • 29. Conclusion  We have achieved 4.78 PFlops, 21.5% efficiency of peak performance on the fluid simulation code on the large- scale PEZY-SC2 based system  We developed an automatic code generation framework for TB, a scheme suitable for it and a backend for PEZY- SC2 accelerator  Our achieved efficiency is comparable to other works on high B/F systems