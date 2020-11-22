Successfully reported this slideshow.
11 September 2018 Yuichiro Ajima, Takahiro Kawashima, Takayuki Okamoto, Naoyuki Shida, Kouichi Hirai, Toshiyuki Shimizu, S...
Introduction  The Tofu interconnect (Tofu1) for the K computer  Highly-scalable and fault-tolerant 6D mesh/torus network...
 6D Mesh/Torus Network  Virtual 3D-Torus Rank-mapping  Implementations  Communication Functions  Tofu Barrier  Netwo...
6D Mesh/Torus Network  Six coordinate axes: X, Y, Z, A, B, C  X, Y, Z: the size varies according to the system configura...
Virtual 3D-Torus Rank-mapping  A rank-mapping option for topology-awareness  A 3D torus rank can be mapped to a 6D subme...
Copyright 2018 FUJITSU LIMITED  Tofu1: implemented as an interconnect controller (ICC) chip  4 Tofu network interfaces (...
Copyright 2018 FUJITSU LIMITED  Remote direct memory access (RDMA)  Directly accesses process memory on remote node  RD...
Tofu Barrier (1)  Barrier gate (BG) is a hard-wired communication engine  Waits for two signals from other BGs and trans...
Tofu Barrier (2)  Reduce-broadcast algorithm uses a maximum of 5 BGs in each process  In Tofu1 and Tofu2, only TNI #0 ha...
Networks of Recent World-class Systems  All systems have the same order of bisection bandwidth  No significant performan...
 Higher-density Node Configuration  Link Configuration and Injection Bandwidth  Packaging  Dynamic Packet Slicing  In...
Higher-density Node Configuration  The CPU is smaller and the off-chip channels are halved  The number of 3D-stacked mem...
Copyright 2018 FUJITSU LIMITED  Data transfer rate increased from 25 Gbps to 28 Gbps  Link bandwidth reduced from 12.5 G...
Packaging – CPU Memory Unit of Post-K  Two CPUs connected with C-axis  X×Y×Z×A×B×C = 1×1×1×1×1×2  Two or three active o...
Packaging – Rack Structure of Post-K  Rack  8 shelves  192 CMUs or 384 CPUs  Shelf  24 CMUs or 48 CPUs  X×Y×Z×A×B×C ...
Dynamic Packet Slicing – Split Mode  An upper layer in TofuD slices packets for each signal lane  Each signal lane of To...
Dynamic Packet Slicing – Duplicate Mode  The upper layer duplicates packets when the error rate is high  This fall-down ...
Increased Tofu Barrier Resources  The number of resources significantly increased  All 6 TNIs of TofuD have Tofu barrier...
 Evaluation Environment  Put Latencies  Latency Breakdown  Put Throughputs  Injection Rates  Tofu Barrier Performanc...
Evaluation Environment  TofuD  Evaluated by hardware emulators using the production RTL codes • The simulation models we...
Put Latencies  8B Put transfer between nodes on the same board  The low-latency features were used  Tofu2 reduced the P...
Latency Breakdown  The overhead increase in Tofu2 has been reduced Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IE...
Put Throughputs  One-way Put transfer between nodes on the same board  Measured saturated throughput values at message s...
Injection Rates per Node  Simultaneous Put transfers to multiple nearest-neighbor nodes  Tofu1 and Tofu2 used 4 TNIs, an...
Tofu Barrier – Test Programs  The test program synchronized multiple BCHs in a node  Executed in one processor core to s...
Tofu Barrier – Results  The simple estimate results were too low  Missing consideration of the serialization of BCHs/BGs...
Summary  TofuD is developed for the post-K machine  TofuD is designed to achieve high-density nodes and enhanced resilie...
The A64FX is a 64-bit ARM architecture microprocessor designed by Fujitsu. The processor is replacing the SPARC64 V as Fujitsu's processor for supercomputer applications. It powers the Fugaku supercomputer, the fastest supercomputer in the world by TOP500 rankings as of June 2020.
Cores: : 48 per CPU plus optional assistant cores;
Instruction set: ARMv8.2-A with SVE and SBBA ...
Common manufacturer(s): TSMC

The tofu-interconnect-d

  1. 1. 11 September 2018 Yuichiro Ajima, Takahiro Kawashima, Takayuki Okamoto, Naoyuki Shida, Kouichi Hirai, Toshiyuki Shimizu, Shinya Hiramoto, Yoshiro Ikeda, Takahide Yoshikawa, Kenji Uchida, Tomohiro Inoue Fujitsu Limited The Tofu Interconnect D Copyright 2018 FUJITSU LIMITED0September 11th, 2018, IEEE Cluster 2018
  2. 2. Introduction  The Tofu interconnect (Tofu1) for the K computer  Highly-scalable and fault-tolerant 6D mesh/torus network  The Tofu interconnect 2 (Tofu2) for FX100 machines  The Tofu Interconnect D (TofuD) for the post-K machine  High “density” of node: integrate more resources into a smaller node  Fault resilient of network: “dynamic” packet slicing for packet transfer Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 2009 2012 20152003 2021 InfiniBandDTU K computer FX10 FX100FX1HPC2500 Post-K Tofu2Tofu1 TofuD 1
  3. 3.  6D Mesh/Torus Network  Virtual 3D-Torus Rank-mapping  Implementations  Communication Functions  Tofu Barrier  Networks of Recent World-class Systems Features of the Tofu interconnect family Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 2
  4. 4. 6D Mesh/Torus Network  Six coordinate axes: X, Y, Z, A, B, C  X, Y, Z: the size varies according to the system configuration  A, B, C: the size is fixed to 2×3×2  Tofu stands for “torus fusion”: (X, Y, Z)×(A, B, C) Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 Z X C A B X×Y×Z×2×3×2 Y 3
  5. 5. Virtual 3D-Torus Rank-mapping  A rank-mapping option for topology-awareness  A 3D torus rank can be mapped to a 6D submesh even if there is an unavailable node  This fault tolerance contributes to the system availability Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 6D submesh 0 11 10 18 9 27 3 56 4 B Y 0 7 1 6 2 5 3 4A X 0 7 1 6 2 5 3 4 C Z 4
  6. 6. Copyright 2018 FUJITSU LIMITED  Tofu1: implemented as an interconnect controller (ICC) chip  4 Tofu network interfaces (TNIs) and 80 lanes of signals for the network  Tofu2: integrated into a processor chip  The number of signal lanes for the network decreased to 40  Considering the balance with 128 signal lanes for memory Implementations September 11th, 2018, IEEE Cluster 2018 PCIe TNI0 TNI1 TNI2 TNI3 TofuNetworkRouter c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c HMC HMC HMC HMC HMC HMC HMC HMC 4lanes×10ports SPARC64 XIfx C M G C M G Tofu2 PCIe TNI0 TNI1 TNI2 TNI3 TofuNetworkRouter c 8lanes×10ports ICC Tofu1 SPARC64 VIIIfx DIMMDIMMDIMMDIMMDIMMDIMMDIMMDIMM c c c c c c c 5
  7. 7. Copyright 2018 FUJITSU LIMITED  Remote direct memory access (RDMA)  Directly accesses process memory on remote node  RDMA Put transfers data to remote process memory  RDMA Get transfers data from remote process memory  RDMA Atomic modifies a shared variable in remote process memory  Low latency features  Direct Descriptor: feeds communication commands from CPU registers  Cache Injection (since Tofu2): places received data into a CPU cache  Tofu Barrier  Offload engine for collective communications such as synchronization Communication Functions of TNI September 11th, 2018, IEEE Cluster 2018 6
  8. 8. Tofu Barrier (1)  Barrier gate (BG) is a hard-wired communication engine  Waits for two signals from other BGs and transmits two signals  Barrier channel (BCH) is an interface of Tofu barrier  Each BCH is fixedly bound to a start-and-end point BG  Tofu barrier can execute an arbitrary communication algorithm  Recursive-doubling algorithm uses log2(n) of BGs in each process Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 Process 0 Process 1 Process 2 Process 3 Process 4 Process 5 Process 6 Process 7 Process 8 Process 9 Barrier Channel Barrier GateRecursive Doubling 7
  9. 9. Tofu Barrier (2)  Reduce-broadcast algorithm uses a maximum of 5 BGs in each process  In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier  8 BCHs and 64 BGs per node  Up to 8 communicators per node can use Tofu barrier simultaneously  Intra-node synchronization is recommended to be performed using software to reduce consumption of BCH and BG Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 Process 0 Process 1 Process 2 Process 3 Process 4 Process 5 Process 6 Process 7 Process 8 Process 9 Barrier Channel Barrier GateReduce-Broadcast Tree 8
  10. 10. Networks of Recent World-class Systems  All systems have the same order of bisection bandwidth  No significant performance difference in global data exchange  Torus networks have higher total injection bandwidth  Topology-aware communication such as nearest-neighbor data exchange results in higher performance Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 System Network Total Injection Bandwidth (PB/s) Bisection Bandwidth (TB/s) Blue Gene/Q Torus (5D) 1.97 49 K Computer Mesh/Torus (6D) Virtual Torus (3D) 1.66 46 34 Sunway TaihuLight Tapered Fat-Tree 0.51 70 Piz Daint Dragonfly 0.07 36 Summit Fat-Tree 0.12 115 Oakforest-PACS Fat-Tree 0.10 102 40X 36X 7.3X 2.0X 1.0X 1.0X 9
  11. 11.  Higher-density Node Configuration  Link Configuration and Injection Bandwidth  Packaging  Dynamic Packet Slicing  Increased Tofu Barrier Resources The Design of TofuD Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 10
  12. 12. Higher-density Node Configuration  The CPU is smaller and the off-chip channels are halved  The number of 3D-stacked memories was halved from 8 to 4  Each Tofu link was reduced from 4 lanes to 2 lanes  More resources are integrated into the CPU  The number of CPU Memory Groups (NUMA nodes) doubled from 2 to 4  The number of Tofu Network Interfaces increased from 4 to 6 Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 PCIe TNI0 TNI1 TNI2 TNI3 TofuNetworkRouter c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c HMC HMC HMC HMC HMC HMC HMC HMC 4lanes×10ports SPARC64 XIfx C M G C M G Tofu2 PCIe TNI0 NOC c c c c c c c c c c c cc c c c c c c cc c c c c c HBM2 c c c c c c c c c c c cc HBM2 HBM2 c c c c c c cc c c c c c HBM2 TNI1 TNI2 TNI3 TNI4 TNI5 lanes×10ports2lanes×10ports TofuNetworkRouter A64FX C M G C M G C M G C M G TofuD 11
  13. 13. Copyright 2018 FUJITSU LIMITED  Data transfer rate increased from 25 Gbps to 28 Gbps  Link bandwidth reduced from 12.5 GB/s to 6.8 GB/s  TofuD simultaneously transmits in 6 directions  Increased from 4 directions in the case of Tofu1 and Tofu2  Total injection bandwidth per node is 40.8 GB/s  Approximately, twice that of Tofu1 or 80% that of Tofu2 Link Configuration and Injection Bandwidth September 11th, 2018, IEEE Cluster 2018 Tofu1 Tofu2 TofuD Data rate (Gbps) 6.25 25.78125 28.05 Number of signal lanes per link 8 4 2 Link bandwidth (GB/s) 5.0 12.5 6.8 Number of TNIs per node 4 4 6 Injection bandwidth per node (GB/s) 20 50 40.8 12
  14. 14. Packaging – CPU Memory Unit of Post-K  Two CPUs connected with C-axis  X×Y×Z×A×B×C = 1×1×1×1×1×2  Two or three active optical cable (AOC) cages on the board  Each cable bundles two lanes of signals from each of the two CPUs Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 CPU CPU AOC (X) AOC (Y) AOC (Z) AOC AOC 13
  15. 15. Packaging – Rack Structure of Post-K  Rack  8 shelves  192 CMUs or 384 CPUs  Shelf  24 CMUs or 48 CPUs  X×Y×Z×A×B×C = 1×1×4×2×3×2  Top or bottom half of rack  4 shelves  X×Y×Z×A×B×C = 2×2×4×2×3×2 Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 Rack Shelves 14
  16. 16. Dynamic Packet Slicing – Split Mode  An upper layer in TofuD slices packets for each signal lane  Each signal lane of TofuD has an independent physical layer  In the ordinary multi-lane transmission, the physical layer has media- independent interface and hides the number of signal lanes  For virtual cut-through packet transfer, the routing header is copied to both slices of the packet  This normal operation mode is called split mode Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 Slice 0 Slice 1 Packet Routing Header Slice 0 Slice 1 15
  17. 17. Dynamic Packet Slicing – Duplicate Mode  The upper layer duplicates packets when the error rate is high  This fall-down mode is called duplicate mode  The link can recover to the split mode  Each lane is never disconnected independently  The error rates of both lanes are continuously monitored and fed back Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 Packet Packet Packet Routing Header Slice 0 Slice 1 Error rate feedback 16
  18. 18. Increased Tofu Barrier Resources  The number of resources significantly increased  All 6 TNIs of TofuD have Tofu barrier  This change intended to support synchronization between CMGs  The ratio of the BCHs to BGs was increased from 1:8 to 1:3  Assuming an increase in the usage of the reduce-broadcast tree algorithm Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 Tofu1 Tofu2 TofuD Node Number of BCHs 8 8 96 Number of BGs 64 64 288 Number of TNIs 4 4 6 Number of CMGs 1 2 4 TNI Number of BCHs 8 8 16 Number of BGs 64 64 48 17
  19. 19.  Evaluation Environment  Put Latencies  Latency Breakdown  Put Throughputs  Injection Rates  Tofu Barrier Performance Evaluations Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 18
  20. 20. Evaluation Environment  TofuD  Evaluated by hardware emulators using the production RTL codes • The simulation models were system-level and included multiple nodes • Simulated processors executed test programs • The test programs directly accessed the TofuD hardware  Results were measured from simulation waveforms  Tofu1 and Tofu2  Evaluations used real machines • Real processors executed test programs • The test programs used low-level communication libraries  Results were measured using the processor’s cycle counter  Only latency breakdowns were obtained from simulation waveforms Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 19
  21. 21. Put Latencies  8B Put transfer between nodes on the same board  The low-latency features were used  Tofu2 reduced the Put latency by 0.20 μs from that of Tofu1  The cache injection feature contributed to this reduction  TofuD reduced the Put latency by 0.22 μs from that of Tofu2 Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 Communication settings Latency Tofu1 Descriptor on main memory 1.15 µs Direct Descriptor 0.91 µs Tofu2 Cache injection OFF 0.87 µs Cache injection ON 0.71 µs TofuD To/From far CMGs 0.54 µs To/From near CMGs 0.49 µs 0.20 µs 0.22 µs 20
  22. 22. Latency Breakdown  The overhead increase in Tofu2 has been reduced Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 0 100 200 300 400 500 600 700 800 900 1000 Tofu1 Tofu2 TofuD Latency(nsec) Rx CPU Rx Host bus Rx TNI Packet Transfer Tx TNI Tx Host bus Tx CPU Cache injection Tx Optimization Rx Optimization Increased Overhead in Physical Layer Overhead Reduced 21
  23. 23. Put Throughputs  One-way Put transfer between nodes on the same board  Measured saturated throughput values at message sizes over 1 MiB  All measured efficiencies exceed 90%  The Tofu interconnect family has high bandwidth efficiency  The maximum packet size is large enough to encapsulate an IP packet Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 Put throughput Efficiency Tofu1 4.76 GB/s 95 % Tofu2 11.46 GB/s 92 % TofuD 6.35 GB/s 93 % 22
  24. 24. Injection Rates per Node  Simultaneous Put transfers to multiple nearest-neighbor nodes  Tofu1 and Tofu2 used 4 TNIs, and TofuD used 6 TNIs  The injection rate of TofuD was approximately 83% that of Tofu2  The efficiencies of Tofu1 were lower than 90%  Because of a bottleneck in the bus that connects CPU and ICC  The efficiencies of Tofu2 and TofuD exceeded 90 %  Integration into the processor chip removed the bottleneck Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 Injection rate Efficiency Tofu1 (K) 15.0 GB/s 77 % Tofu1 (FX10) 17.6 GB/s 88 % Tofu2 45.8 GB/s 92 % TofuD 38.1 GB/s 93 % 23
  25. 25. Tofu Barrier – Test Programs  The test program synchronized multiple BCHs in a node  Executed in one processor core to simplify the waveform analysis  The test programs used the following algorithms;  The reduce-broadcast tree algorithm for intra-TNI synchronization  The recursive doubling algorithm for inter-TNI synchronization  Simple estimations were also calculated  Accumulated logic circuit delays of BCH (0.48 μs) and BG (0.13 μs)  Considered only the number of communication stages Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 Number of BCHs 1 4 8 16 48 Number of used TNIs 1 4 6 6 6 Max. number of BCHs per TNI 1 1 2 3 8 Max. number of BGs per TNI 2 2 5 9 24 Number of communication stages 2 2 4 6 9 24
  26. 26. Tofu Barrier – Results  The simple estimate results were too low  Missing consideration of the serialization of BCHs/BGs processing  The modified estimates were consistent with the evaluation results  BCHs need to be allocated in a round-robin manner to avoid sharing a TNI Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 0.0 1.0 2.0 3.0 4.0 5.0 1 4 16 64 Latency(μsec) Number of BCHs per node Estimate considering serialization Evaluation Result Simple Estimate underestimate 25
  27. 27. Summary  TofuD is developed for the post-K machine  TofuD is designed to achieve high-density nodes and enhanced resilience with dynamic packet slicing  The design of TofuD  Node and link configurations  CMU and rack packaging  Dynamic Packet Slicing  Increased Tofu Barrier Resources  The evaluation results of TofuD  Latency was 0.49 μs, which was reduced by 0.22 μs from that of Tofu2  Throughput was 6.35 GB/s and the efficiency exceeded 90%  Injection rate was 38.1 GB/s, which was approximately 83% that of Tofu2  The evaluation results showed that it is necessary to allocate BCHs without sharing a TNI Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 26
