SlideShare a Scribd company logo
Center for Computational Sciences, Univ. of Tsukuba
Taisuke Boku Norihisa Fujita Ryohei Kobayashi Osamu Tatebe
Center for Computational Sciences
University of Tsukuba
{taisuke,fujita}@ccs.tsukuba.ac.jp {kobayashi,tatebe}@cs.tsukuba.ac.jp
Cygnus – World First Multihybrid Accelerated Cluster
wtih GPU and FPGA Coupling
2022/08/29
1
DUAC2022
Center for Computational Sciences, Univ. of Tsukuba
Accelerators in HPC ⇒ majority = GPU
n Is GPU perfect ?
n good for many applications (replacing vector machines)
n depending on very wide and regular parallelism
n large scale SIMD (STMD) mechanism in a chip
n high bandwidth memory (HBM, HBM2) and local memory
n insufficient for cases with...
n not enough parallelism
n not regular computation (warp divergence)
n frequent inter-node communication (kernel switch, go back to CPU)
2022/08/29 DUAC2022
2
NVIDIA Tesla A100
Tensor Core GPU
(from NVIDIA web page)
Center for Computational Sciences, Univ. of Tsukuba
FPGA in HPC
n Goodness of recent FPGA for HPC
n True codesigning with applications (essential)
n Programmability improvement: OpenCL, other high level languages
n High performance interconnect: 100Gb x N
n Precision control is possible
n Relatively low power
n Problems
n Programmability: OpenCL is not enough, not efficient
n Low standard FLOPS: still cannot catch up to GPU
-> “never try what GPU works well on”
n Memory bandwidth: 1-gen older than high end CPU/GPU
-> be improved by HBM (Stratix10)
2022/08/29 DUAC2022
3
BittWare 520N with Intel Stratix10 FPGA
equipped with 4x 100Gbps optical
interconnection interfaces
Center for Computational Sciences, Univ. of Tsukuba
GPU vs FPGA as HPC solutions
2022/08/29 DUAC2022
4
device GPU FPGA
parallelization SIMD (x multi-group) pipeline (x multi-group)
standard FLOPS 😃😃 (1000x cores) 😃 (~100x pipeline)
conditional branch 😢 (warp divergence) 😃 (both direction)
memory 😃😃 (HBM2e) 😢 (DDR)→😃 (HBM2)
interconnect 😢 (via host facility) 😃😃 (own optical links)
programming 😃 (CUDA, OpenACC, OpenMP) 😢 (HDL)→😏 (HLS)
self-controllability 😢 (slave device of host CPU) 😃 (autonomic)
HPC applications 😃 (various fields) 😢 (not much)
Center for Computational Sciences, Univ. of Tsukuba
CHARM: Cooperative Heterogeneous Acceleration with
Reconfigurable Multi-devices
2022/08/29 DUAC2022
5
CPU
GPU
FPGA
comp.
PCIe
comm.
invoke GPU/FPGA kernsls
data transfer via PCIe
(invoked from FPGA)
CPU
GPU
FPGA
comp.
PCIe
comm.
FPGA Network
Application oriented
FPGA-FPGA communication
Basic cluster with GPUs (by InfiniBand)
100Gbps direct optical link
multi-physics/multi-scale
complicated problem
Cooperative computing with GPU and FPGA
Center for Computational Sciences, Univ. of Tsukuba
Cygnus: world first multi-hybrid cluster with GPU+FPGA
2022/08/29 DUAC2022
6
Cygnus supercomputer at Center for Computational Sciences, Univ. of Tsukuba (Apr. 2019~)
85 nodes in total including 32 “Albireo” nodes with GPU+FPGA (other “Deneb” nodes have GPU only)
@ CCS, Univ. of Tsukuba (deployed by NEC)
Center for Computational Sciences, Univ. of Tsukuba
Single node configuration (Albireo)
2022/08/29 DUAC2022
7
CPU
PCIe network (switch)
G
P
U
G
P
U
FPGA
HCA HCA
Inter-FPGA
direct network
(100Gbps x4)
Network switch
(100Gbps x2)
CPU
PCIe network (switch)
G
P
U
G
P
U
FPGA
HCA HCA
Inter-FPGA
direct network
(100Gbps x4)
SINGLE
NODE
(with FPGA)
• Each node is equipped with
both IB EDR and FPGA-direct
network
• Some nodes are equipped
with both FPGAs and GPUs,
and other nodes are with
GPUs only
Network switch
(100Gbps x2)
Center for Computational Sciences, Univ. of Tsukuba
Two types of interconnection network
FPGA FPGA FPGA
FPGA FPGA FPGA
FPGA FPGA FPGA
comp.
node
…
IB HDR100/200 Network (100Gbps x4/node)
For all computation nodes (Albireo and Deneb) are connected by full-bisection
Fat Tree network with 4 channels of InfiniBand HDR100 (combined to HDR200
switch) for parallel processing communication such as MPI, and also used to
access to Lustre shared file system.
comp.
node
comp.
node
…
comp.
node
Deneb nodes Albireo nodes
comp.
node
comp.
node
Inter-FPGA direct network
(only for Albireo nodes)
InfiniBand HDR100/200 network for parallel processing
communication and shared file system access from all nodes
…
…
Inter-FPGA torus network
64 of FPGAs on Albireo nodes (2FPGAS/node)
are connected by 8x8 2D torus network
without switch
8 2022/08/29 DUAC2022
Center for Computational Sciences, Univ. of Tsukuba
2022/08/29 DUAC2022
9
G
P
U
G
P
U
G
P
U
G
P
U
F
P
G
A
F
P
G
A
CPU CPU
IB HDR100 x4
⇨ HDR200 x2
100Gbps x4
FPGA optical
network x2
IB HDR200
switch (for
full-bisection
Fat-Tree)
Albireo node
1.2Tbps/node
Center for Computational Sciences, Univ. of Tsukuba
Research to support CHARM model on Cygnus
n FPGA-network: CIRCUS (Communication Integrated Reconfigurable CompUting System)
n direct interconnect facility among FPGA boards by multi-dimensional optical link (~100Gps) with router and
OpenCL-ready API
n pipelining all computation and communication seamlessly
n GPU-FPGA DMA: kicked by FPGA (without CPU)
n PCIe-protocol base DMA engine to reduce multi-device high speed data transfer
n Programming:
n Intel oneAPI
⇒ task-by-task manner assignment of computation part to GPU and FPGA under DPC++ device queue
management
n Appllication: ARGOT, application on astrophysics for early-universe object generation
n two main parts are executed by GPU and FPGA
2022/08/29 DUAC2022
10
CIRCUS
n Intel FPGA SDK for OpenCL
n We can describe FPGA hardware in OpenCL
n Problem: How to write inter-FPGA communication code in OpenCL?
n MPI is the standard method for HPC applications
n It is memory-to-memory communication, not suitable for FPGAs
n We need to utilize pipeline-based communication in an FPGA
n →CIRCUS: Communication Integrated Reconfigurable CompUting System
n Pipelined communication and computation
n communicate from or to a computation pipeline directly
11
sender(__global float* restrict x, int n)
{
for (int i = 0; i < n; i++) {
float v = x[i];
write_channel_intel(simple_out, v);
}
}
sender code on FPGA1
receiver(__global float* restrict x, int n) {
for (int i = 0; i < n; i++) {
float v = read_channel_intel(simple_in);
x[i] = v;
}
}
receiver code on FPGA2
Comm.
Backend
* N. Fujita, et al., Performance Evaluation of Pipelined Communication Combined with Computation in OpenCL Programming on FPGA , AsHES2020.
2022/08/29 DUAC2022
CIRCUS performance
latency+ /hop
~250 ns
Latency(1hop~7hops)
max. throughput
90.2Gbps
min. latency
500ns
Throughput(1hop~7hops)
12
Better
Better
Evaluated on up to 8 Bittware 520N FPGA boards in Cygnus supercomputer at CCS, University of Tsukuba
N. Fujita, et al., Performance Evaluation of Pipelined Communication Combined with Computation in OpenCL Programming on FPGA , AsHES2020.
2022/08/29 DUAC2022
Center for Computational Sciences, Univ. of Tsukuba
What CIRCUS provides ?
n CIRCUS: Communication Integrated Reconfigurable CompUting System
n Goal1: providing High Level Synthesis programming environment for parallel FPGA system by
FPGA-FPGA communication link
n Goal2: combining computation pipeline and communication pipeline seamlessly to fully utilize
the goodness of FPGA computation/communication
2022/08/29 DUAC2022
13
Center for Computational Sciences, Univ. of Tsukuba
14
CHARM by oneAPI
n In oneAPI, programming in DPC++ is
recommended
−(a) approach
−Problem: Existing GPU and FPGA code
is written in other languages such as
CUDA, OpenCL, etc
These code assets already exist
−Reimplementation by DPC++ is a
burden for users
n oneAPI also can use modules
written in other languages
−(b) approach
−Code can be reused
2022/08/29 DUAC2022
Center for Computational Sciences, Univ. of Tsukuba
Application Example – ARGOT code
n ARGOT (Accelerated Radiative transfer on Grids using Oct-Tree)
n Simulator for early stage universe where the first stars and galaxies were born
n Radiative transfer code developed in Center for Computational Sciences (CCS),
University of Tsukuba
n CPU (OpenMP) and GPU (CUDA) implementations are available
n Inter-node parallelisms is also supported using MPI
n ART (Authentic Radiation Transfer) method
n It solves radiative transfer from light source spreading out in the space
n Dominant computation part (90%~) of the ARGOT program
n We accelerate the ART method on an FPGA using Intel FPGA SDK for
OpenCL as an HLS environment (with oneAPI)
15 2022/08/29 DUAC2022
Cosmic Radiative Transfer Simulation
ARGOT *
16
Point Source Diffuse Photon
Two computation elements in ARGOT code: ARGOT method and ART method
• ARGOT method: Point Source processing
• ART method (Authentic Radiation Transfer): Diffused Photon processing
2022/08/29 DUAC2022
Cosmic Radiative Transfer Simulation
ARGOT *
17
GPU acceleration
ARGOT scheme
for radiative transfer (RT)
from point source
ARGOT (Accelerated Radiative transfer on Grids using Oct-Tree) code
Point Source
ART scheme
for RT from matters
spatially spreading out
FPGA acceleration
Diffuse Photon
FPGA
GPU GPU GPU
FPGA FPGA
CHARM
Two computation elements in ARGOT code: ARGOT method and ART method
• ARGOT method: Point Source processing
• ART method (Authentic Radiation Transfer): Diffused Photon processing
2022/08/29 DUAC2022
Center for Computational Sciences, Univ. of Tsukuba
18
Performance evaluation (bare CUDA+OpenCL vs with oneAPI)
n problem size of 32!
n Single node (1 GPU + 1
FPGA)
n ART
−ART on GPU is slow
−FPGA can accelerate
pipelined manner
n oneAPI vs CUDA+OpenCL
−The execution time of oneAPI
is increased by 1.5%
=> almost no overhead
Lower
is
better
R. Kashino, et al., Multi-hetero Acceleration by GPU and FPGA for Astrophysics Simulation on oneAPI Environment , HPC Asia 2022, Jan. 2022.
2022/08/29 DUAC2022
ARGOT with 2 nodes (2 GPUs/IB + 2 FPGAs/CIRCUS)
n Weak scaling, 32x32x32 mesh for each node
19
0
0.5
1
1.5
2
2.5
GPU-only GPU + FPGA GPU-only GPU + FPGA
1 Node / (32, 32, 32) 2 Nodes / (64, 32, 32)
Execution
time
[s]
# of Nodes / total mesh size
Others
ART comm.
ART comp.
ART init
Optical depth accumulation
Ray segment assginment
ARGOT comp.
ARGOT init
Lower
is
better
0.89
0.13
2.05
0.16
・1 node performance
・GPU+FPGA : GPU-only
= 6.8x higher
・2 nodes performance
・GPU+FPGA : GPU-only
= 12.8x higher
ART method part:
GPU-GPU MPI comm. is so heavy
→ large overhead by small chunks
of multiple data copy
FPGA-FPGA MPI comm. by CIRCUS
→ very effective
・low latency & high bandwidth
・comp. + comm. pipelining
Center for Computational Sciences, Univ. of Tsukuba
Summary
n Toward Exa-scale era, homogeneous or single accelerator system will have limitation on application
variation and scalability
n CCS, U. Tsukuba, is running a multi-hetero supercomputer named Cygnus under CHARM
(Cooperative Heterogeneous Acceleration with Reconfigurable Multi-devices) concept by GPU+FPGA
n Several supporting systems on FPGA and GPU coworking are developed including language solution
toward high sustained performance of multi-physical simulations
n FPGA for HPC is a new concept toward next generation’s flexible and low power solution beyond
GPU-only computing
n Multi-physics simulation is the first stage target of Cygnus and will be expanded to variety of
applications where GPU-only solution has some bottleneck
n Current FPGA-side implementation is based on OpenCL barely, and we need to expand to other
languages and other run-time systems
n Call me if you want to use Cygnus with us!
2022/08/29 DUAC2022
20

More Related Content

Similar to Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling

Jg3515961599
Jg3515961599Jg3515961599
Jg3515961599
IJERA Editor
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020
OpenACC
 
Final presentation [dissertation project], 20192 esv0002
Final presentation [dissertation project], 20192 esv0002Final presentation [dissertation project], 20192 esv0002
Final presentation [dissertation project], 20192 esv0002
MOHAMMED FURQHAN
 
Reconfigurable ICs
Reconfigurable ICsReconfigurable ICs
Reconfigurable ICs
Anish Goel
 
11 Synchoricity as the basis for going Beyond Moore
11 Synchoricity as the basis for going Beyond Moore11 Synchoricity as the basis for going Beyond Moore
11 Synchoricity as the basis for going Beyond Moore
RCCSRENKEI
 
FPGA
FPGAFPGA
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
Kohei KaiGai
 
OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021
OpenACC
 
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
Hideyuki Tanaka
 
FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.ppt
grssieee
 
How to Terminate the GLIF by Building a Campus Big Data Freeway System
How to Terminate the GLIF by Building a Campus Big Data Freeway SystemHow to Terminate the GLIF by Building a Campus Big Data Freeway System
How to Terminate the GLIF by Building a Campus Big Data Freeway System
Larry Smarr
 
Design And Simulation of Modulation Schemes used for FPGA Based Software Defi...
Design And Simulation of Modulation Schemes used for FPGA Based Software Defi...Design And Simulation of Modulation Schemes used for FPGA Based Software Defi...
Design And Simulation of Modulation Schemes used for FPGA Based Software Defi...
Sucharita Saha
 
OpenACC and Open Hackathons Monthly Highlights: April 2022
OpenACC and Open Hackathons Monthly Highlights: April 2022OpenACC and Open Hackathons Monthly Highlights: April 2022
OpenACC and Open Hackathons Monthly Highlights: April 2022
OpenACC
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PC Cluster Consortium
 
OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020
OpenACC
 
Qo s based mac protocol for medical wireless body area sensor networks
Qo s based mac protocol for medical wireless body area sensor networksQo s based mac protocol for medical wireless body area sensor networks
Qo s based mac protocol for medical wireless body area sensor networks
Iffat Anjum
 
Design and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processorDesign and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processor
Najeeb Ahmad
 
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...
Larry Smarr
 
0507036
05070360507036
0507036
meraz rizel
 

Similar to Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling (20)

Jg3515961599
Jg3515961599Jg3515961599
Jg3515961599
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020
 
Final presentation [dissertation project], 20192 esv0002
Final presentation [dissertation project], 20192 esv0002Final presentation [dissertation project], 20192 esv0002
Final presentation [dissertation project], 20192 esv0002
 
Reconfigurable ICs
Reconfigurable ICsReconfigurable ICs
Reconfigurable ICs
 
11 Synchoricity as the basis for going Beyond Moore
11 Synchoricity as the basis for going Beyond Moore11 Synchoricity as the basis for going Beyond Moore
11 Synchoricity as the basis for going Beyond Moore
 
FPGA
FPGAFPGA
FPGA
 
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021
 
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
 
FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.ppt
 
How to Terminate the GLIF by Building a Campus Big Data Freeway System
How to Terminate the GLIF by Building a Campus Big Data Freeway SystemHow to Terminate the GLIF by Building a Campus Big Data Freeway System
How to Terminate the GLIF by Building a Campus Big Data Freeway System
 
Design And Simulation of Modulation Schemes used for FPGA Based Software Defi...
Design And Simulation of Modulation Schemes used for FPGA Based Software Defi...Design And Simulation of Modulation Schemes used for FPGA Based Software Defi...
Design And Simulation of Modulation Schemes used for FPGA Based Software Defi...
 
OpenACC and Open Hackathons Monthly Highlights: April 2022
OpenACC and Open Hackathons Monthly Highlights: April 2022OpenACC and Open Hackathons Monthly Highlights: April 2022
OpenACC and Open Hackathons Monthly Highlights: April 2022
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
 
OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020
 
Qo s based mac protocol for medical wireless body area sensor networks
Qo s based mac protocol for medical wireless body area sensor networksQo s based mac protocol for medical wireless body area sensor networks
Qo s based mac protocol for medical wireless body area sensor networks
 
Design and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processorDesign and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processor
 
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...
 
0507036
05070360507036
0507036
 

More from Carlos Reaño González

DUAC 2022 Workshop - Welcome Slides
DUAC 2022 Workshop - Welcome SlidesDUAC 2022 Workshop - Welcome Slides
DUAC 2022 Workshop - Welcome Slides
Carlos Reaño González
 
vAccel: Interoperable Application Hardware Acceleration
vAccel: Interoperable Application Hardware AccelerationvAccel: Interoperable Application Hardware Acceleration
vAccel: Interoperable Application Hardware Acceleration
Carlos Reaño González
 
A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU
A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPUA Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU
A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU
Carlos Reaño González
 
Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs ...
Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs ...Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs ...
Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs ...
Carlos Reaño González
 
Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early...
Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early...Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early...
Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early...
Carlos Reaño González
 
A framework for low communication approaches for large scale 3D convolution
A framework for low communication approaches for large scale 3D convolutionA framework for low communication approaches for large scale 3D convolution
A framework for low communication approaches for large scale 3D convolution
Carlos Reaño González
 

More from Carlos Reaño González (6)

DUAC 2022 Workshop - Welcome Slides
DUAC 2022 Workshop - Welcome SlidesDUAC 2022 Workshop - Welcome Slides
DUAC 2022 Workshop - Welcome Slides
 
vAccel: Interoperable Application Hardware Acceleration
vAccel: Interoperable Application Hardware AccelerationvAccel: Interoperable Application Hardware Acceleration
vAccel: Interoperable Application Hardware Acceleration
 
A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU
A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPUA Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU
A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU
 
Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs ...
Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs ...Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs ...
Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs ...
 
Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early...
Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early...Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early...
Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early...
 
A framework for low communication approaches for large scale 3D convolution
A framework for low communication approaches for large scale 3D convolutionA framework for low communication approaches for large scale 3D convolution
A framework for low communication approaches for large scale 3D convolution
 

Recently uploaded

23PH301 - Optics - Unit 1 - Optical Lenses
23PH301 - Optics  -  Unit 1 - Optical Lenses23PH301 - Optics  -  Unit 1 - Optical Lenses
23PH301 - Optics - Unit 1 - Optical Lenses
RDhivya6
 
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
ABHISHEK SONI NIMT INSTITUTE OF MEDICAL AND PARAMEDCIAL SCIENCES , GOVT PG COLLEGE NOIDA
 
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Sérgio Sacani
 
Polycythemia vera_causes_disorders_treatment.pptx
Polycythemia vera_causes_disorders_treatment.pptxPolycythemia vera_causes_disorders_treatment.pptx
Polycythemia vera_causes_disorders_treatment.pptx
muralinath2
 
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Sérgio Sacani
 
Microbiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdfMicrobiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdf
sammy700571
 
Quality assurance B.pharm 6th semester BP606T UNIT 5
Quality assurance B.pharm 6th semester BP606T UNIT 5Quality assurance B.pharm 6th semester BP606T UNIT 5
Quality assurance B.pharm 6th semester BP606T UNIT 5
vimalveerammal
 
Lattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptxLattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptx
DrRajeshDas
 
Synopsis presentation VDR gene polymorphism and anemia (2).pptx
Synopsis presentation VDR gene polymorphism and anemia (2).pptxSynopsis presentation VDR gene polymorphism and anemia (2).pptx
Synopsis presentation VDR gene polymorphism and anemia (2).pptx
FarhanaHussain18
 
23PH301 - Optics - Unit 2 - Interference
23PH301 - Optics - Unit 2 - Interference23PH301 - Optics - Unit 2 - Interference
23PH301 - Optics - Unit 2 - Interference
RDhivya6
 
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Creative-Biolabs
 
2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf
lucianamillenium
 
Introduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptxIntroduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptx
QusayMaghayerh
 
Flow chart.pdf LIFE SCIENCES CSIR UGC NET CONTENT
Flow chart.pdf  LIFE SCIENCES CSIR UGC NET CONTENTFlow chart.pdf  LIFE SCIENCES CSIR UGC NET CONTENT
Flow chart.pdf LIFE SCIENCES CSIR UGC NET CONTENT
savindersingh16
 
Physiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptxPhysiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptx
fatima132662
 
Clinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdfClinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdf
RAYMUNDONAVARROCORON
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
Frédéric Baudron
 
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
Sérgio Sacani
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 
Reaching the age of Adolescence- Class 8
Reaching the age of Adolescence- Class 8Reaching the age of Adolescence- Class 8
Reaching the age of Adolescence- Class 8
abhinayakamasamudram
 

Recently uploaded (20)

23PH301 - Optics - Unit 1 - Optical Lenses
23PH301 - Optics  -  Unit 1 - Optical Lenses23PH301 - Optics  -  Unit 1 - Optical Lenses
23PH301 - Optics - Unit 1 - Optical Lenses
 
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
 
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
 
Polycythemia vera_causes_disorders_treatment.pptx
Polycythemia vera_causes_disorders_treatment.pptxPolycythemia vera_causes_disorders_treatment.pptx
Polycythemia vera_causes_disorders_treatment.pptx
 
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
 
Microbiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdfMicrobiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdf
 
Quality assurance B.pharm 6th semester BP606T UNIT 5
Quality assurance B.pharm 6th semester BP606T UNIT 5Quality assurance B.pharm 6th semester BP606T UNIT 5
Quality assurance B.pharm 6th semester BP606T UNIT 5
 
Lattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptxLattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptx
 
Synopsis presentation VDR gene polymorphism and anemia (2).pptx
Synopsis presentation VDR gene polymorphism and anemia (2).pptxSynopsis presentation VDR gene polymorphism and anemia (2).pptx
Synopsis presentation VDR gene polymorphism and anemia (2).pptx
 
23PH301 - Optics - Unit 2 - Interference
23PH301 - Optics - Unit 2 - Interference23PH301 - Optics - Unit 2 - Interference
23PH301 - Optics - Unit 2 - Interference
 
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
 
2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf
 
Introduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptxIntroduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptx
 
Flow chart.pdf LIFE SCIENCES CSIR UGC NET CONTENT
Flow chart.pdf  LIFE SCIENCES CSIR UGC NET CONTENTFlow chart.pdf  LIFE SCIENCES CSIR UGC NET CONTENT
Flow chart.pdf LIFE SCIENCES CSIR UGC NET CONTENT
 
Physiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptxPhysiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptx
 
Clinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdfClinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdf
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
 
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 
Reaching the age of Adolescence- Class 8
Reaching the age of Adolescence- Class 8Reaching the age of Adolescence- Class 8
Reaching the age of Adolescence- Class 8
 

Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling

  • 1. Center for Computational Sciences, Univ. of Tsukuba Taisuke Boku Norihisa Fujita Ryohei Kobayashi Osamu Tatebe Center for Computational Sciences University of Tsukuba {taisuke,fujita}@ccs.tsukuba.ac.jp {kobayashi,tatebe}@cs.tsukuba.ac.jp Cygnus – World First Multihybrid Accelerated Cluster wtih GPU and FPGA Coupling 2022/08/29 1 DUAC2022
  • 2. Center for Computational Sciences, Univ. of Tsukuba Accelerators in HPC ⇒ majority = GPU n Is GPU perfect ? n good for many applications (replacing vector machines) n depending on very wide and regular parallelism n large scale SIMD (STMD) mechanism in a chip n high bandwidth memory (HBM, HBM2) and local memory n insufficient for cases with... n not enough parallelism n not regular computation (warp divergence) n frequent inter-node communication (kernel switch, go back to CPU) 2022/08/29 DUAC2022 2 NVIDIA Tesla A100 Tensor Core GPU (from NVIDIA web page)
  • 3. Center for Computational Sciences, Univ. of Tsukuba FPGA in HPC n Goodness of recent FPGA for HPC n True codesigning with applications (essential) n Programmability improvement: OpenCL, other high level languages n High performance interconnect: 100Gb x N n Precision control is possible n Relatively low power n Problems n Programmability: OpenCL is not enough, not efficient n Low standard FLOPS: still cannot catch up to GPU -> “never try what GPU works well on” n Memory bandwidth: 1-gen older than high end CPU/GPU -> be improved by HBM (Stratix10) 2022/08/29 DUAC2022 3 BittWare 520N with Intel Stratix10 FPGA equipped with 4x 100Gbps optical interconnection interfaces
  • 4. Center for Computational Sciences, Univ. of Tsukuba GPU vs FPGA as HPC solutions 2022/08/29 DUAC2022 4 device GPU FPGA parallelization SIMD (x multi-group) pipeline (x multi-group) standard FLOPS 😃😃 (1000x cores) 😃 (~100x pipeline) conditional branch 😢 (warp divergence) 😃 (both direction) memory 😃😃 (HBM2e) 😢 (DDR)→😃 (HBM2) interconnect 😢 (via host facility) 😃😃 (own optical links) programming 😃 (CUDA, OpenACC, OpenMP) 😢 (HDL)→😏 (HLS) self-controllability 😢 (slave device of host CPU) 😃 (autonomic) HPC applications 😃 (various fields) 😢 (not much)
  • 5. Center for Computational Sciences, Univ. of Tsukuba CHARM: Cooperative Heterogeneous Acceleration with Reconfigurable Multi-devices 2022/08/29 DUAC2022 5 CPU GPU FPGA comp. PCIe comm. invoke GPU/FPGA kernsls data transfer via PCIe (invoked from FPGA) CPU GPU FPGA comp. PCIe comm. FPGA Network Application oriented FPGA-FPGA communication Basic cluster with GPUs (by InfiniBand) 100Gbps direct optical link multi-physics/multi-scale complicated problem Cooperative computing with GPU and FPGA
  • 6. Center for Computational Sciences, Univ. of Tsukuba Cygnus: world first multi-hybrid cluster with GPU+FPGA 2022/08/29 DUAC2022 6 Cygnus supercomputer at Center for Computational Sciences, Univ. of Tsukuba (Apr. 2019~) 85 nodes in total including 32 “Albireo” nodes with GPU+FPGA (other “Deneb” nodes have GPU only) @ CCS, Univ. of Tsukuba (deployed by NEC)
  • 7. Center for Computational Sciences, Univ. of Tsukuba Single node configuration (Albireo) 2022/08/29 DUAC2022 7 CPU PCIe network (switch) G P U G P U FPGA HCA HCA Inter-FPGA direct network (100Gbps x4) Network switch (100Gbps x2) CPU PCIe network (switch) G P U G P U FPGA HCA HCA Inter-FPGA direct network (100Gbps x4) SINGLE NODE (with FPGA) • Each node is equipped with both IB EDR and FPGA-direct network • Some nodes are equipped with both FPGAs and GPUs, and other nodes are with GPUs only Network switch (100Gbps x2)
  • 8. Center for Computational Sciences, Univ. of Tsukuba Two types of interconnection network FPGA FPGA FPGA FPGA FPGA FPGA FPGA FPGA FPGA comp. node … IB HDR100/200 Network (100Gbps x4/node) For all computation nodes (Albireo and Deneb) are connected by full-bisection Fat Tree network with 4 channels of InfiniBand HDR100 (combined to HDR200 switch) for parallel processing communication such as MPI, and also used to access to Lustre shared file system. comp. node comp. node … comp. node Deneb nodes Albireo nodes comp. node comp. node Inter-FPGA direct network (only for Albireo nodes) InfiniBand HDR100/200 network for parallel processing communication and shared file system access from all nodes … … Inter-FPGA torus network 64 of FPGAs on Albireo nodes (2FPGAS/node) are connected by 8x8 2D torus network without switch 8 2022/08/29 DUAC2022
  • 9. Center for Computational Sciences, Univ. of Tsukuba 2022/08/29 DUAC2022 9 G P U G P U G P U G P U F P G A F P G A CPU CPU IB HDR100 x4 ⇨ HDR200 x2 100Gbps x4 FPGA optical network x2 IB HDR200 switch (for full-bisection Fat-Tree) Albireo node 1.2Tbps/node
  • 10. Center for Computational Sciences, Univ. of Tsukuba Research to support CHARM model on Cygnus n FPGA-network: CIRCUS (Communication Integrated Reconfigurable CompUting System) n direct interconnect facility among FPGA boards by multi-dimensional optical link (~100Gps) with router and OpenCL-ready API n pipelining all computation and communication seamlessly n GPU-FPGA DMA: kicked by FPGA (without CPU) n PCIe-protocol base DMA engine to reduce multi-device high speed data transfer n Programming: n Intel oneAPI ⇒ task-by-task manner assignment of computation part to GPU and FPGA under DPC++ device queue management n Appllication: ARGOT, application on astrophysics for early-universe object generation n two main parts are executed by GPU and FPGA 2022/08/29 DUAC2022 10
  • 11. CIRCUS n Intel FPGA SDK for OpenCL n We can describe FPGA hardware in OpenCL n Problem: How to write inter-FPGA communication code in OpenCL? n MPI is the standard method for HPC applications n It is memory-to-memory communication, not suitable for FPGAs n We need to utilize pipeline-based communication in an FPGA n →CIRCUS: Communication Integrated Reconfigurable CompUting System n Pipelined communication and computation n communicate from or to a computation pipeline directly 11 sender(__global float* restrict x, int n) { for (int i = 0; i < n; i++) { float v = x[i]; write_channel_intel(simple_out, v); } } sender code on FPGA1 receiver(__global float* restrict x, int n) { for (int i = 0; i < n; i++) { float v = read_channel_intel(simple_in); x[i] = v; } } receiver code on FPGA2 Comm. Backend * N. Fujita, et al., Performance Evaluation of Pipelined Communication Combined with Computation in OpenCL Programming on FPGA , AsHES2020. 2022/08/29 DUAC2022
  • 12. CIRCUS performance latency+ /hop ~250 ns Latency(1hop~7hops) max. throughput 90.2Gbps min. latency 500ns Throughput(1hop~7hops) 12 Better Better Evaluated on up to 8 Bittware 520N FPGA boards in Cygnus supercomputer at CCS, University of Tsukuba N. Fujita, et al., Performance Evaluation of Pipelined Communication Combined with Computation in OpenCL Programming on FPGA , AsHES2020. 2022/08/29 DUAC2022
  • 13. Center for Computational Sciences, Univ. of Tsukuba What CIRCUS provides ? n CIRCUS: Communication Integrated Reconfigurable CompUting System n Goal1: providing High Level Synthesis programming environment for parallel FPGA system by FPGA-FPGA communication link n Goal2: combining computation pipeline and communication pipeline seamlessly to fully utilize the goodness of FPGA computation/communication 2022/08/29 DUAC2022 13
  • 14. Center for Computational Sciences, Univ. of Tsukuba 14 CHARM by oneAPI n In oneAPI, programming in DPC++ is recommended −(a) approach −Problem: Existing GPU and FPGA code is written in other languages such as CUDA, OpenCL, etc These code assets already exist −Reimplementation by DPC++ is a burden for users n oneAPI also can use modules written in other languages −(b) approach −Code can be reused 2022/08/29 DUAC2022
  • 15. Center for Computational Sciences, Univ. of Tsukuba Application Example – ARGOT code n ARGOT (Accelerated Radiative transfer on Grids using Oct-Tree) n Simulator for early stage universe where the first stars and galaxies were born n Radiative transfer code developed in Center for Computational Sciences (CCS), University of Tsukuba n CPU (OpenMP) and GPU (CUDA) implementations are available n Inter-node parallelisms is also supported using MPI n ART (Authentic Radiation Transfer) method n It solves radiative transfer from light source spreading out in the space n Dominant computation part (90%~) of the ARGOT program n We accelerate the ART method on an FPGA using Intel FPGA SDK for OpenCL as an HLS environment (with oneAPI) 15 2022/08/29 DUAC2022
  • 16. Cosmic Radiative Transfer Simulation ARGOT * 16 Point Source Diffuse Photon Two computation elements in ARGOT code: ARGOT method and ART method • ARGOT method: Point Source processing • ART method (Authentic Radiation Transfer): Diffused Photon processing 2022/08/29 DUAC2022
  • 17. Cosmic Radiative Transfer Simulation ARGOT * 17 GPU acceleration ARGOT scheme for radiative transfer (RT) from point source ARGOT (Accelerated Radiative transfer on Grids using Oct-Tree) code Point Source ART scheme for RT from matters spatially spreading out FPGA acceleration Diffuse Photon FPGA GPU GPU GPU FPGA FPGA CHARM Two computation elements in ARGOT code: ARGOT method and ART method • ARGOT method: Point Source processing • ART method (Authentic Radiation Transfer): Diffused Photon processing 2022/08/29 DUAC2022
  • 18. Center for Computational Sciences, Univ. of Tsukuba 18 Performance evaluation (bare CUDA+OpenCL vs with oneAPI) n problem size of 32! n Single node (1 GPU + 1 FPGA) n ART −ART on GPU is slow −FPGA can accelerate pipelined manner n oneAPI vs CUDA+OpenCL −The execution time of oneAPI is increased by 1.5% => almost no overhead Lower is better R. Kashino, et al., Multi-hetero Acceleration by GPU and FPGA for Astrophysics Simulation on oneAPI Environment , HPC Asia 2022, Jan. 2022. 2022/08/29 DUAC2022
  • 19. ARGOT with 2 nodes (2 GPUs/IB + 2 FPGAs/CIRCUS) n Weak scaling, 32x32x32 mesh for each node 19 0 0.5 1 1.5 2 2.5 GPU-only GPU + FPGA GPU-only GPU + FPGA 1 Node / (32, 32, 32) 2 Nodes / (64, 32, 32) Execution time [s] # of Nodes / total mesh size Others ART comm. ART comp. ART init Optical depth accumulation Ray segment assginment ARGOT comp. ARGOT init Lower is better 0.89 0.13 2.05 0.16 ・1 node performance ・GPU+FPGA : GPU-only = 6.8x higher ・2 nodes performance ・GPU+FPGA : GPU-only = 12.8x higher ART method part: GPU-GPU MPI comm. is so heavy → large overhead by small chunks of multiple data copy FPGA-FPGA MPI comm. by CIRCUS → very effective ・low latency & high bandwidth ・comp. + comm. pipelining
  • 20. Center for Computational Sciences, Univ. of Tsukuba Summary n Toward Exa-scale era, homogeneous or single accelerator system will have limitation on application variation and scalability n CCS, U. Tsukuba, is running a multi-hetero supercomputer named Cygnus under CHARM (Cooperative Heterogeneous Acceleration with Reconfigurable Multi-devices) concept by GPU+FPGA n Several supporting systems on FPGA and GPU coworking are developed including language solution toward high sustained performance of multi-physical simulations n FPGA for HPC is a new concept toward next generation’s flexible and low power solution beyond GPU-only computing n Multi-physics simulation is the first stage target of Cygnus and will be expanded to variety of applications where GPU-only solution has some bottleneck n Current FPGA-side implementation is based on OpenCL barely, and we need to expand to other languages and other run-time systems n Call me if you want to use Cygnus with us! 2022/08/29 DUAC2022 20