Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling

Center for Computational Sciences, Univ. of Tsukuba
Taisuke Boku Norihisa Fujita Ryohei Kobayashi Osamu Tatebe
Center for Computational Sciences
University of Tsukuba
{taisuke,fujita}@ccs.tsukuba.ac.jp {kobayashi,tatebe}@cs.tsukuba.ac.jp
Cygnus – World First Multihybrid Accelerated Cluster
wtih GPU and FPGA Coupling
2022/08/29
1
DUAC2022

Accelerators in HPC ⇒ majority = GPU
n Is GPU perfect ?
n good for many applications (replacing vector machines)
n depending on very wide and regular parallelism
n large scale SIMD (STMD) mechanism in a chip
n high bandwidth memory (HBM, HBM2) and local memory
n insufficient for cases with...
n not enough parallelism
n not regular computation (warp divergence)
n frequent inter-node communication (kernel switch, go back to CPU)
2022/08/29 DUAC2022
2
NVIDIA Tesla A100
Tensor Core GPU
(from NVIDIA web page)

FPGA in HPC
n Goodness of recent FPGA for HPC
n True codesigning with applications (essential)
n Programmability improvement: OpenCL, other high level languages
n High performance interconnect: 100Gb x N
n Precision control is possible
n Relatively low power
n Problems
n Programmability: OpenCL is not enough, not efficient
n Low standard FLOPS: still cannot catch up to GPU
-> “never try what GPU works well on”
n Memory bandwidth: 1-gen older than high end CPU/GPU
-> be improved by HBM (Stratix10)
2022/08/29 DUAC2022
3
BittWare 520N with Intel Stratix10 FPGA
equipped with 4x 100Gbps optical
interconnection interfaces

GPU vs FPGA as HPC solutions
2022/08/29 DUAC2022
4
device GPU FPGA
parallelization SIMD (x multi-group) pipeline (x multi-group)
standard FLOPS 😃😃 (1000x cores) 😃 (~100x pipeline)
conditional branch 😢 (warp divergence) 😃 (both direction)
memory 😃😃 (HBM2e) 😢 (DDR)→😃 (HBM2)
interconnect 😢 (via host facility) 😃😃 (own optical links)
programming 😃 (CUDA, OpenACC, OpenMP) 😢 (HDL)→😏 (HLS)
self-controllability 😢 (slave device of host CPU) 😃 (autonomic)
HPC applications 😃 (various fields) 😢 (not much)

CHARM: Cooperative Heterogeneous Acceleration with
Reconfigurable Multi-devices
2022/08/29 DUAC2022
5
CPU
GPU
FPGA
comp.
PCIe
comm.
invoke GPU/FPGA kernsls
data transfer via PCIe
(invoked from FPGA)
CPU
GPU
FPGA
comp.
PCIe
comm.
FPGA Network
Application oriented
FPGA-FPGA communication
Basic cluster with GPUs (by InfiniBand)
100Gbps direct optical link
multi-physics/multi-scale
complicated problem
Cooperative computing with GPU and FPGA

Cygnus: world first multi-hybrid cluster with GPU+FPGA
2022/08/29 DUAC2022
6
Cygnus supercomputer at Center for Computational Sciences, Univ. of Tsukuba (Apr. 2019~)
85 nodes in total including 32 “Albireo” nodes with GPU+FPGA (other “Deneb” nodes have GPU only)
@ CCS, Univ. of Tsukuba (deployed by NEC)

Single node configuration (Albireo)
2022/08/29 DUAC2022
7
CPU
PCIe network (switch)
G
P
U
G
P
U
FPGA
HCA HCA
Inter-FPGA
direct network
(100Gbps x4)
Network switch
(100Gbps x2)
CPU
PCIe network (switch)
G
P
U
G
P
U
FPGA
HCA HCA
Inter-FPGA
direct network
(100Gbps x4)
SINGLE
NODE
(with FPGA)
• Each node is equipped with
both IB EDR and FPGA-direct
network
• Some nodes are equipped
with both FPGAs and GPUs,
and other nodes are with
GPUs only
Network switch
(100Gbps x2)

Two types of interconnection network
FPGA FPGA FPGA
FPGA FPGA FPGA
FPGA FPGA FPGA
comp.
node
…
IB HDR100/200 Network (100Gbps x4/node)
For all computation nodes (Albireo and Deneb) are connected by full-bisection
Fat Tree network with 4 channels of InfiniBand HDR100 (combined to HDR200
switch) for parallel processing communication such as MPI, and also used to
access to Lustre shared file system.
comp.
node
comp.
node
…
comp.
node
Deneb nodes Albireo nodes
comp.
node
comp.
node
Inter-FPGA direct network
(only for Albireo nodes)
InfiniBand HDR100/200 network for parallel processing
communication and shared file system access from all nodes
…
…
Inter-FPGA torus network
64 of FPGAs on Albireo nodes (2FPGAS/node)
are connected by 8x8 2D torus network
without switch
8 2022/08/29 DUAC2022

2022/08/29 DUAC2022
9
G
P
U
G
P
U
G
P
U
G
P
U
F
P
G
A
F
P
G
A
CPU CPU
IB HDR100 x4
⇨ HDR200 x2
100Gbps x4
FPGA optical
network x2
IB HDR200
switch (for
full-bisection
Fat-Tree)
Albireo node
1.2Tbps/node

Research to support CHARM model on Cygnus
n FPGA-network: CIRCUS (Communication Integrated Reconfigurable CompUting System)
n direct interconnect facility among FPGA boards by multi-dimensional optical link (~100Gps) with router and
OpenCL-ready API
n pipelining all computation and communication seamlessly
n GPU-FPGA DMA: kicked by FPGA (without CPU)
n PCIe-protocol base DMA engine to reduce multi-device high speed data transfer
n Programming:
n Intel oneAPI
⇒ task-by-task manner assignment of computation part to GPU and FPGA under DPC++ device queue
management
n Appllication: ARGOT, application on astrophysics for early-universe object generation
n two main parts are executed by GPU and FPGA
2022/08/29 DUAC2022
10

CIRCUS
n Intel FPGA SDK for OpenCL
n We can describe FPGA hardware in OpenCL
n Problem: How to write inter-FPGA communication code in OpenCL?
n MPI is the standard method for HPC applications
n It is memory-to-memory communication, not suitable for FPGAs
n We need to utilize pipeline-based communication in an FPGA
n →CIRCUS: Communication Integrated Reconfigurable CompUting System
n Pipelined communication and computation
n communicate from or to a computation pipeline directly
11
sender(__global float* restrict x, int n)
{
for (int i = 0; i < n; i++) {
float v = x[i];
write_channel_intel(simple_out, v);
}
}
sender code on FPGA1
receiver(__global float* restrict x, int n) {
for (int i = 0; i < n; i++) {
float v = read_channel_intel(simple_in);
x[i] = v;
}
}
receiver code on FPGA2
Comm.
Backend
* N. Fujita, et al., Performance Evaluation of Pipelined Communication Combined with Computation in OpenCL Programming on FPGA , AsHES2020.
2022/08/29 DUAC2022

CIRCUS performance
latency+ /hop
~250 ns
Latency（1hop~7hops）
max. throughput
90.2Gbps
min. latency
500ns
Throughput（1hop~7hops）
12
Better
Better
Evaluated on up to 8 Bittware 520N FPGA boards in Cygnus supercomputer at CCS, University of Tsukuba
N. Fujita, et al., Performance Evaluation of Pipelined Communication Combined with Computation in OpenCL Programming on FPGA , AsHES2020.
2022/08/29 DUAC2022

What CIRCUS provides ?
n CIRCUS: Communication Integrated Reconfigurable CompUting System
n Goal1: providing High Level Synthesis programming environment for parallel FPGA system by
FPGA-FPGA communication link
n Goal2: combining computation pipeline and communication pipeline seamlessly to fully utilize
the goodness of FPGA computation/communication
2022/08/29 DUAC2022
13

14
CHARM by oneAPI
n In oneAPI, programming in DPC++ is
recommended
−(a) approach
−Problem: Existing GPU and FPGA code
is written in other languages such as
CUDA, OpenCL, etc
These code assets already exist
−Reimplementation by DPC++ is a
burden for users
n oneAPI also can use modules
written in other languages
−(b) approach
−Code can be reused
2022/08/29 DUAC2022

Application Example – ARGOT code
n ARGOT (Accelerated Radiative transfer on Grids using Oct-Tree)
n Simulator for early stage universe where the first stars and galaxies were born
n Radiative transfer code developed in Center for Computational Sciences (CCS),
University of Tsukuba
n CPU (OpenMP) and GPU (CUDA) implementations are available
n Inter-node parallelisms is also supported using MPI
n ART (Authentic Radiation Transfer) method
n It solves radiative transfer from light source spreading out in the space
n Dominant computation part (90%~) of the ARGOT program
n We accelerate the ART method on an FPGA using Intel FPGA SDK for
OpenCL as an HLS environment (with oneAPI)
15 2022/08/29 DUAC2022

Cosmic Radiative Transfer Simulation
ARGOT *
16
Point Source Diffuse Photon
Two computation elements in ARGOT code: ARGOT method and ART method
• ARGOT method: Point Source processing
• ART method (Authentic Radiation Transfer): Diffused Photon processing
2022/08/29 DUAC2022

Cosmic Radiative Transfer Simulation
ARGOT *
17
GPU acceleration
ARGOT scheme
for radiative transfer (RT)
from point source
ARGOT (Accelerated Radiative transfer on Grids using Oct-Tree) code
Point Source
ART scheme
for RT from matters
spatially spreading out
FPGA acceleration
Diffuse Photon
FPGA
GPU GPU GPU
FPGA FPGA
CHARM
Two computation elements in ARGOT code: ARGOT method and ART method
• ARGOT method: Point Source processing
• ART method (Authentic Radiation Transfer): Diffused Photon processing
2022/08/29 DUAC2022

18
Performance evaluation (bare CUDA+OpenCL vs with oneAPI)
n problem size of 32!
n Single node (1 GPU + 1
FPGA)
n ART
−ART on GPU is slow
−FPGA can accelerate
pipelined manner
n oneAPI vs CUDA+OpenCL
−The execution time of oneAPI
is increased by 1.5%
=> almost no overhead
Lower
is
better
R. Kashino, et al., Multi-hetero Acceleration by GPU and FPGA for Astrophysics Simulation on oneAPI Environment , HPC Asia 2022, Jan. 2022.
2022/08/29 DUAC2022

ARGOT with 2 nodes (2 GPUs/IB + 2 FPGAs/CIRCUS)
n Weak scaling, 32x32x32 mesh for each node
19
0
0.5
1
1.5
2
2.5
GPU-only GPU + FPGA GPU-only GPU + FPGA
1 Node / (32, 32, 32) 2 Nodes / (64, 32, 32)
Execution
time
[s]
# of Nodes / total mesh size
Others
ART comm.
ART comp.
ART init
Optical depth accumulation
Ray segment assginment
ARGOT comp.
ARGOT init
Lower
is
better
0.89
0.13
2.05
0.16
・1 node performance
・GPU+FPGA : GPU-only
= 6.8x higher
・2 nodes performance
・GPU+FPGA : GPU-only
= 12.8x higher
ART method part:
GPU-GPU MPI comm. is so heavy
→ large overhead by small chunks
of multiple data copy
FPGA-FPGA MPI comm. by CIRCUS
→ very effective
・low latency & high bandwidth
・comp. + comm. pipelining

Summary
n Toward Exa-scale era, homogeneous or single accelerator system will have limitation on application
variation and scalability
n CCS, U. Tsukuba, is running a multi-hetero supercomputer named Cygnus under CHARM
(Cooperative Heterogeneous Acceleration with Reconfigurable Multi-devices) concept by GPU+FPGA
n Several supporting systems on FPGA and GPU coworking are developed including language solution
toward high sustained performance of multi-physical simulations
n FPGA for HPC is a new concept toward next generation’s flexible and low power solution beyond
GPU-only computing
n Multi-physics simulation is the first stage target of Cygnus and will be expanded to variety of
applications where GPU-only solution has some bottleneck
n Current FPGA-side implementation is based on OpenCL barely, and we need to expand to other
languages and other run-time systems
n Call me if you want to use Cygnus with us!
2022/08/29 DUAC2022
20

Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling

Recommended

Recommended

More Related Content

Similar to Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling

Similar to Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling (20)

More from Carlos Reaño González

More from Carlos Reaño González (6)

Recently uploaded

Recently uploaded (20)

Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling