High-speed network switch RHiNET-2/SW
                   and its implementation with optical interconnections

RHiNET (RWCP High performance network) is                                cables are used for electrical interconnection, t...
Fig. 3: 12-channel parallel optical
          interconnection modules [9, 10]
routing (which provides a certain size of packet      damaged header. A handshake packet includes
   buffer) is adopted.  ...
packet and then the link between the two switches is    LSI socket. This socket was designed specifically
established. The...
the optical receiver module and reconverted into
                                                        electrical signal...
skew should be suppressed to less than 250 ps. To                      VIII.    RELATED WORKS
suppress the skew, we used h...
observed no errors during 10 11-bit packet data             interconnection for high performance parallel
transmission at ...
Upcoming SlideShare
Loading in …5

High-speed network switch RHiNET-2/SW and its implementation ...


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

High-speed network switch RHiNET-2/SW and its implementation ...

  1. 1. High-speed network switch RHiNET-2/SW and its implementation with optical interconnections S. Nishimura (1) , T. Kudoh(2) , H. Nishi(2) , J. Yamamoto(2) , K. Harasawa(3) , N. Matsudaira(3) , S. Akutsu(3) , K. Tasyo(4) , and H. Amano (5) (1) RWCP Optical Interconnection Hitachi Laboratory. (c/o Central Research Laboratory, Hitachi, Ltd.) 1-280 Higashi-Koigakubo, Kokubunji, Tokyo 185-8601, JAPAN (2) Real World Computing Partnership Tsukuba Research Center (3) Hitachi Communication Systems, Inc. (4) Synergetech, Inc. (5) Dept. of Information and Computer Science, Keio University Abstract--RHiNET-2/SW is a network switch always in use. If the processing power of such idle that enables high-performance parallel computing computers could be combined, the resulting in a distributed environment. We have produced a processing power might be comparable to that of a prototype network-switch board (RHiNET-2/SW) supercomputer. for the RHiNET-2 parallel-computing system. However, most high performance cluster systems Eight pairs of 800-Mbit/s 12-channel optical consisting of personal computers or workstations interconnection modules and a one-chip CMOS use a system area network (or a server area network: ASIC switch LSI (a 784-pin BGA package) are SAN) such as Myrinet [4] as their interconnection. mounted on a single board. This switch allows SAN provides low latency high bandwidth high-speed 8-Gbit/s/port parallel optical data communication without discarding any packet. It transmission over a distance of up to 100 m, and also provides high bisection bandwidth, which is the aggregate throughput is 64 Gbit/s/board. By required for high performance parallel computing. using a large amount of embedded memory on the Since they are designed to connect dedicated switching LSI, RHiNET-2/SW allows low-latency, computers in a small place, both the link length and free-topology network performance. We topology are restricted. On the other hand, high- evaluated the reliability of each optical port by speed LANs with more than 1 Gbit/s link bandwidth measuring BER: no errors were detected during are becoming available [5-6]. LANs provide 10 11-bit packet data transmission at a data rate of relatively flexible topology choices and longer length of links. Nevertheless, the communication 880 Mbit/s 10 bits (BER: < 10 -11). This test result latency of most of today's commodity LANs tends shows that the RHiNET-2/SW can provide high- to be larger than that of SANs because of their throughput, long-transmission-length, and highly store-and-forward routing strategy. Moreover, reliable data transmission. today's LANs support the IP protocol, consisting of a lot of layers, which introduce overhead. LASN (Local Area System Network) is a new I. INTRODUCTION class of networks which has the advantages of both SANs and LANs. As shown in Fig. 1, an LASN Network-based parallel processing using assumes an environment in which personal commodity components, such as personal computers and workstations are distributed within computers, has received attention as an important one or more floors of a building (i.e. a LAN parallel-computing environment [1-3]. In most of environment). today's offices and laboratories, there are tens of personal computers and workstations, which are not
  2. 2. RHiNET (RWCP High performance network) is cables are used for electrical interconnection, the the first network designed with the concepts of an transmission length is limited to about 10 m by LASN in mind. electrical circuit drivability). To meet the requirements of RHiNET, RHiNET-2/SW provides large internal memory blocks and supports topology free, reliable, low-latency and high- bandwidth communication. To achieve such high- speed optical transmission in RHiNET, 8.8-Gbit/s (800-Mbit/s×11-bit) synchronized parallel optical interconnection is used for each data link. Synchronized parallel optical interconnection allow s high-speed, long-transmission-length, low-latency node-to-node interconnection [9-13]. III. OPTICAL INTERCONNECTION MODULE Fig. 1: Schematic structure of RHiNET We use synchronized parallel 12-channel optical transmitter and receiver modules in RHiNET-2/SW [9, 10] (Fig. 3). The optical transmitter modules II. CONCEPT OF THE RHINET consist of a 1.3-µm edge-emitting laser-diode (LD) array and a single-mode-fiber (SMF) array. The We have developed the firstprototype called channel configuration is made up of 11 low- RHiNET-1 using 1.33-Gbit/s optical voltage-differential-signaling (LVDS)/positive-level interconnections [7] and second prototype called emitter-coupled-logic (P-ECL) non-return-to-zero RHiNET-2 using 8.8-Git/s optical interconnections. (NRZ) data signals and one LVDS/P-ECL clock In RHiNET-2, PCs are interconnected via high- signal. The P-ECL output signals from the receiver throughput 8×8 network switches (the RHiNET- module are converted to LVDS signals with a level 2/SW) [8]. RHiNET-2/SW has eight optical input converter. The input clock signal is used to latch and eight optical output ports (Fig. 2). Each port the 11 data signals in the transmitter and receiver has an 8-Gbit/s transmission capacity, and the modules in order to eliminate skew caused by the aggregate throughput is 64 Gbit/s. A 12-bit logic LSIs. A transmission length is up to 100 m synchronized parallel optical signal is converted to and total throughput is up to 8.8 Gbit/s/module a 12-bit electrical signal in the optical receiver, (800-Mbit/s×11-bit data and 1-bit clock). To switched by the SW-LSI, and re-converted to a 12- achieve high-density implementation with high- bit parallel optical signal in the optical transmitter. speed signaling devices, we have to overcome many The transmission length is limited to 100 m by the complex problems, as such as crosstalk, skew and skew of the fiber ribbon (however, when copper propagation-loss. 12-channel optical receiver 12-channel optical transmitter 10+2 10+2 10+2 10+2 1 1 SW-LSI TX optical input 2 RX 2 optical output 8×8 switch 800 Mbits × 800 Mbit/s × ≅≅ 10 bits/port; 10 bits/port; Internal memory: with clock & framing 7 7 with clock & framing ≅512 kbyte 8 8 10+2 10+2 10+2 10+2 printed circuit board electrical signals (differential) LVDS 800 Mbit/s × 10 bits/port; with clock & framing Fig. 2: Schematic structure of RHiNET-2/SW
  3. 3. Fig. 3: 12-channel parallel optical interconnection modules [9, 10] Fig. 4: Block diagram of the switch core in the SW-LSI for RHiNET-2/SW IV. SWITCHING LSI A. Overview We developed a 64-Gbit/s/chip high-throughput CMOS switching LSI for the RHiNET-2/SW (Figs. 2, 4 and 5). This switching LSI has eight input and eight output ports. Both input and output ports consist of 10-bit data signals, a clock signal, and a framing signal of 800 Mbit/s. The core switch logic operates at a clock rate of 100 MHz. Therefore, 1:8 demultiplexers are provided at the input ports and 1:8 multiplexers are provided at the output ports. 8×10-bit incoming data are transformed to the 80- bit data by the demultiplexer. ECC decoders and encoders are provided at the input and output ports respectively. The ECC decoder decodes the 80-bit data to a 66-bit data and is handled by the core logic. Input signals synchronized with the transmission clock are retimed to be in-phase with the baseclock (200 MHz) in the elastic buffer. Since the source synchronous clocking is used, an Fig. 5: Floor plan of the SW-LSI elastic buffer is provided at each input port to compensate the difference between the transmission clock and the baseclock of up to 100-ppm. All electrical I/O interfaces are 2.5-V LVDS- CMOS devices. To achieve high-speed I/O, rise and B. Switching functions fall times (< 0.3 ns) and a signal skew (< 0.3 ns) RHiNET-2/SW has the following features (Figs. 4 must be very small. We used 0.18-µm technology and 5): to fabricate the LSI. The LSI-package is a 784-pin 1) Asynchronous wormhole routing ball grid array (BGA), the pin-pitch of the package Store and forward routing, which is is 1.27 mm, and the package size is 39×39 mm. commonly used in conventional LAN There are 384 high-speed signal pins (12 switches/routers, yields a large latency. bits/port×8 ports×2 pins [differential] for input and Wormhole routing achieves low latency output; data rate: 800 Mbit/s/pin). We customized switching, since a switch can simultaneously the assignment of the LSI pins to achieve high- transmit the first part of a packet if possible, speed, low-crosstalk data I/O with a compact, high- even while receiving the latter part of the same density circuit board. packet [2]. However, the performance of pure wormhole routing is severely degraded when a message is multicast in a loaded network. To cope with this problem, asynchronous wormhole
  4. 4. routing (which provides a certain size of packet damaged header. A handshake packet includes buffer) is adopted. programmable “almost full flags” of all VCs. A 2) No packet discarding ping/pong packet reports its own logical ID, The switch never discards packets even when physical ID, and port ID. Command packets are the network is severely congested. used to exchange information between maintenance 3) In-order delivery processors of adjacent nodes. They are The network ensures in-order delivery of immediately forwarded to the maintenance packets. processor when received. The payload of the 4) Free topology design while avoiding command packet is the message to the maintenance deadlock processor. The switch avoids deadlock by providing a number of VCs (virtual channels) at each input port. By using a different VC as a packet travels through the switches, no cyclic dependency is generated. RHiNET-2/SW has 16 VCs at each input port. This means the diameter of the network can be up to 16. Since each switch has eight ports, this number of VCs provides virtually free topology of the network. 5) Supports up to 100-m links RHiNET-2/SW supports optical links up to100 m long. Since an optical signal propagates in the fiber at the speed of 5 ns/m, the round-trip delay of a 100-m-long optical link is about 1 µs. The handshake logic of the switch also yields some delays. Therefore, handshake will produce Fig. 6: RHiNET-2/SW packet format up to 1.5-µs delay. When data rate is 8 Gbit/s, 1.5-µs delay corresponds to the time to transfer 1.5 kbytes. Therefore, to receive data without D. Routing discarding anything, the receiver side should send a handshake message to stop transmission Routing is done according to the routing when it does not have enough usable memory information statically stored in the routing table of space (less than 1.5 kbytes + maximum packet each switch. Each switch has a routing table with size) in the input buffer. Such a flow-control 65,536 entries. An entry is a 9-bit full bitmap of mechanism is called the slack buffer [7]. In the outputs (8 bits correspond to the output ports, RHiNET-2/SW, each of the 16 VCs of an input and one bit corresponds to the maintenance port provides a 4-kbytes slack-buffer processor), and setting multiple bits of an entry mechanism. Multiple slack buffers are therefore provides multicasting. And the routing ID of a provided for each input port. header of a data packet is used as an entry id of the 6) Multiple-bit-rate support routing table. The maintenance processor sets the Each port can be set to the bit rate of 8 Gbit/s, entries of the routing table. For example, if 2 Gbit/s or 1 Gbit/s. The slower bit rates are destination routing is used and there is no provided to support slower network interfaces. multicasting, an entry of the routing table can The maintenance processor sets the bit rate. correspond to a node; thus, a maximum of 65,536 nodes can be supported. C. Packets Figure 6 shows the packet format of RHiNET- E. Maintenance processor and hot-plug support 2/SW handled in core logic (64-bit data). RHiNET- An on-chip maintenance module and an off-chip 2/SW supports variably sized packets. A data maintenance processor are provided to configure packet contains a maximum data size of 2 kbytes. routing tables and support dynamic link detection. The hop counter is incremented when a packet goes While a link has not been established, RHiNET- through a switch and is used to detect an irregularly 2/SW continuously transmits ping packets. When a routed packet caused by wrong routing table or switch receives a ping packet, it replies with a pong
  5. 5. packet and then the link between the two switches is LSI socket. This socket was designed specifically established. The ping-and-pong packet includes for the high-speed LSI (bandwidth: DC to 6 GHz; the sender's physical ID (Fig. 6). By receiving a path inductance: <1 nH; and capacitance [signal-to- ping or a pong packet, the maintenance processor signal]: < 1 pF). Each port has 800-Mbit/s×12-bit obtains the physical ID of the switch at the other optical I/O channels and uses one pair of the 12- end of the link. Then, the maintenance processors channel parallel optical interconnection modules. of the switches exchange the necessary information The board size is 220×270 mm. The eight pairs of to set the routing table. RHiNET-2/SW transmits optical transmitter and receiver modules are handshaking packets in regular intervals during a mounted near the SW-LSI. The daughter board has link is established. It then detects the link an H8 microprocessor subboard to control the disconnection if it receives no handshake packet for maintenance-signals of the SW-LSI. A crystal a certain period of time. In such a case, RHiNET- oscillator is mounted to generate the 200-MHz 2/SW starts to transmit ping packets again. internal clock signal. The structure and layout of the circuit board are optimized for high-speed, high-density implementation [8]. V. HIGH- DENSITY IMPLEMENTATION OF HIGH- Figure 8 shows a photograph of the RHiNET- SPEED SIGNALS 2/SW. These are four sockets of the four-by-twelve- channel fiber adapters. The motherboard is In RHiNET-2/SW, to realize high-speed, high- mounted here on the upper layer of the cabinet. density integration with the optical interconnection The power supply unit and maintenance processor- module and SW-LSI, we employ a MULTIWIRETM * card are packaged here in the lower layer. interconnect board (MWBTM) as a printed circuit board. The MWBTM can achieve high wiring density and superior electrical characteristics (low-loss, high-accuracy 50-ohm impedance, and low- reflection). The MWBTM uses copper wires (0.1 mm diameter) that are coated with polyimide insulation and can therefore be cross wired. This also accounts for the high wiring density (a 0.3 mm wire-pitch can be achieved). Since constant diameter wires are incorporated in the MWBTMs, controlled characteristics impedance (Z0: 50 Ω) can be easily realized. Furthermore, by utilizing very thin wires, with adequate spacing, crosstalk, and bending-loss are minimized. We measured the physical characteristics of the (a) SW-LSI mounted side MWBTM (propagation-loss and crosstalk). In 150- mm-long straight wires, the –3-dB-down bandwidth was greater than 2.4 GHz, and the crosstalk on the receiver side was less than 1.2% at 900 MHz and a wire-pitch of 0.5 mm. We then optimized the layout of the circuit board based on the experimental results to realize low-crosstalk, high-speed, and high-density electrical I/O [8]. VI. RHINET-2/SW We have produced a prototype of the RHiNET- 2/SW eight-by-eight network switch (Fig. 7). In the center of the board, the SW-LSI is mounted in an (b) Optical modules mounted side Fig. 7: Layout of the motherboard of the *: MULTIWIRE is a trademark owned by ADVANCED RHiNET-2/SW INTERCONNECTION TECHNOLOGY, INC.
  6. 6. the optical receiver module and reconverted into electrical signals and sent to the error-rate detector (ERD). The fiber runs were 50 m long. Figure 10 shows the eye-pattern of the measured electrical output signal and the waveform of the clock signal. A clear eye-pattern was obtained. The signal rise time (Tr) and fall time (Tf) of the electrical output signal were both less than 400 ps. The jitter was less than 100 ps. We evaluated the reliability of each optical port by measuring the BER. We observed no errors during 10 11-bit packet data transmission at a data rate of 880 Mbit/s×10 bits. (This corresponds to a BER of less than 10 -11.) We used a 2 24-1 pseudo-random word sequence (PRWS) as a data pattern. These test results show that the reliability of the I/O ports in RHiNET-2/SW is sufficient for RHiNET-2 and that our high-speed and high-density circuit-board layout enables us to construct a high-performance network switch. Fig. 8 Photograph of the cabinet VII. EVALUATION TEST RESULTS We measured the signal eye-pattern by oscilloscope and bit-error (BER) rate by error-rate detector (the measurement setup is shown in Fig. 8). The 800-Mbit/s×12-bit electrical data signals were generated by the data generator (DG) as a clock signal (CLKI), a framing data signal (AI), and Fig. 10: Measured eye-pattern of an electrically 10-bit packet data signals (DI[9..0]). These 12-bit re-converted 0th data bit [D0] and waveform of the electrical signals were converted to 12-bit optical clock signal [CLK] (200 mV/div; 250 ps/div; data signals by the optical transmitter module and rate: 800 Mbit/s). transmitted through the 12-channel fiber ribbon. The optical signals were input to an RX-port of RHiNET-2/SW. In RHiNET-2/SW, the 12-bit optical To achieve highly reliable (error-free) parallel input signals were converted to electrical signals in interconnection, suppressing skew is the most important improvement that must be made. This is the RX-port, propagated through the SW-LSI, and even more important than improving the sensitivity reconverted to optical signals in the corresponding and bandwidth. Our system requirement was that TX-port, then transmitted from the TX-port as the skew be suppressed to within 20% of the clock- optical signals. The output signals were received by cycle. In the case of 800 Mbit/s transmission, the Fig. 9: Experimental setup to measure the signal eye-patterns and BER of RHiNET-2/SW.
  7. 7. skew should be suppressed to less than 250 ps. To VIII. RELATED WORKS suppress the skew, we used high-speed LVDS electrical circuits, and precisely controlled the lengthwise placement of the wires. To suppress the Myrinet [4] is one of the most popular SAN skew between data signals, the 800-Mbit/s × 11-bit widely used for cluster computing. Myrinet synchronized parallel data signals were retimed with switches never discard any packets, and provides an 800-MHz clock signal using gate-latching in the reasonably high link bandwidth (1.28 Gbit/s) and TX and RX modules and at the TX- and RX-ports very low latency. However, Myrinet switches of the RHiNET-2/SW. The fiber length was 50 m. support fewer number of virtual channels. Therefore, network topology restricted so as to We measured the skew of the 10-bit data signal avoid deadlock by using carefully selected routing at the two points of the RHiNET-2/SW using a setup paths. shown in Fig. 9. These two points were the electrical output pins of SW-LSI, and the optical output port GSN [14], is high bandwidth and low latency of the optical transmitter module (Fig. 11). In the interconnect standard, which provides 6.4 Gbit/s input port of the SW-LSI, the skew was eliminated link bandwidth of error-free, and flow controlled by the gate-latching, but the output signal of the data. Although it provides four virtual channels for SW-LSI had a 141-ps skew caused by each link, it is difficult to support a deadlock free nonuniformity of the LSI output port. We routing in a free topology because of the channel eliminated this skew by using gate-latching in the number limitation. A fat tree is used in the cluster optical TX module, and the skew of the optical using GSN. output signal from the output port was 19.4 ps. The Compaq uses the SC interconnect [15] for its maximum skew of our 50-m-long 12-channel fiber inter-server connection. The SC Interconnect ribbon was 50 ps. Thus, after 50-m-long fiber consists of a high-bandwidth crossbar switch and a transmission, the worst-case fiber skew is 69.4 ps. Therefore, the skew of the data-signal is sufficiently PCI adapter for each node. The detailed suppressed by the gate-latching, and thus supports architecture of the SC interconnects is not high-speed and highly reliable synchronized disclosed. However, it also uses a fat tree topology parallel data transmission. to keep a high degree of bisection bandwidth without deadlock. IX. SUMMARY We have developed the RHiNET-2/SW network for high-performance computing using personal computers distributed in an office or floor environment. Optical interconnection allows high- speed, highly reliable data transmission over a long distance. To achieve high-speed and low-latency node-to-node interconnection, we implemented eight pairs of 8.8-Gbit/s optical interconnection modules and a 64-Gbit/s SW-LSI in a compact circuit board. We have produced an optical interconnection module for RHiNET-2/SW that is capable of speeds of up to 8.8 Gbit/s and a one- chip CMOS ASIC switch (784-pin BGA). RHiNET- Fig. 10 Skew of 10-bit data based on the edge of 2/SW has eight input and eight output optical data the 0th data bit (in I/O port 4). ports. The bandwidth of each port is 8 Gbit/s (aggregate throughput of the switch is 64 Gbit/s). We developed a high-speed, high-density implementation technology to overcome electrical problems such as signal propagation-loss and crosstalk. All of the electrical interfaces are composed of high-speed CMOS-LVDS logic. The structure and layout of the circuit board is optimized for high-speed, high-density implementation. Our prototype system achieved 880-Mbit/s×10-bit parallel data transmission. We
  8. 8. observed no errors during 10 11-bit packet data interconnection for high performance parallel transmission at a data rate of 880 Mbit/s×10 bits computing using PCs", pp. 5-12, Anchorage with a 50-m fiber. (This corresponds to a BER of U.S.A., Oct. 1999. less than 10-11.) We have thus successfully produced [9] A. Takai, T. Kato, S. Yamashita, S. Hanatani, Y. a compact high-throughput optical I/O network Motegi, K. Ito, H. Abe, and H. Kodera, "200- switch using a one-chip SW-LSI and eight pairs of optical interconnection modules. This switch Mb/s/ch 100-m Optical Subsystem enables high-performance parallel computing in a Interconnections Using 8-Channel 1.3-µm distributed computing environment. Laser Diode Arrays and Single-Mode Fiber Arrays", J. of Lightwave Technology 12, pp. A CKNOWLEDGEMENT 260-270, 1994. [10] http://www.hitachi.co.jp/Prod/tcd/hikari/tjn0063 We are grateful for the assistance and advice of 4.htm Takahiko Takahashi and Kazuyoshi Satoh of the [11] J. W. Goodman F. I. Leonberger, Sun-Yuan Device Development Center, Hitachi, Ltd., Atsushi Athale, and R. A. Kung, "Optical interconnects Takai and Atsushi Miura of the Telecommunication for VLSI system", Proceedings of the IEEE 72, and Information Infrastructure Systems Group, pp. 159-174, July 1984. Hitachi, Ltd., T. Keicho of Hitachi ULSI Systems [12] D. A. B. Miller and H. W. Ozaktas, "Limit to Co., Ltd., Y. Keikoin and K. Ohsugi of Hitachi the Bit-rate Capacity of Electrical Information Technology Co., Ltd., and M. Tanaka Interconnection from the Aspect Ratio of the of Hitachi Communication Systems, Inc. System Architecture", Journal of Parallel and Distributed Computing 41, pp. 42-52, 1997. REFERENCES [13] S. Nishimura, H. Inoue, H. Matsuoka, and T. Yokota: "Optical interconnection subsystem [1] T. Kudoh, J. Yamamoto, F. Sudoh, H. Amano, used in the RWC-1 massively parallel Y. Ishikawa, and M. Sato: "Memory based light computer", IEEE Journal of Selected Topics weight communication architecture for local on Quantum Electronics 5, pp. 360-367, 1999. area distributed computing'', Innovative [14] http://www.llnl.gov/asci/bluemtn/ architecture for future generation high- [15] http://www.compaq.com/hpc/news/news_sc_ann performance processors and systems, IEEE ounce_p3.html Computer Society Press, pp. 133-139, 1997. [2] L.M. Ni, "Should Scalable Parallel Computers Support Efficient Hardware Multicast", Proceeding of 1995 Int'l Conference on Parallel Processing Workshop on Challenges for Parallel Processing, pp. 2-7, August 1995. [3] T. Horie, H. Ishihara, T. Shimizu, and M. Ikesaka, "AP1000 Architecture and Performance of LU Decomposition", Proceedings of 1991 Int'l Conference on Parallel Processing, pp.634-635, August 1991. [4] http://www.myri.com [5] HIPPI-6400 working drafts, T11.1 maintenance drafts of ANSI NCITS [6] IEEE802.3 Higher Speed Study Group http://grouper.ieee.org/groups/802/3/10G_study /public/index.html [7] H. Nishi, K. Tasho, T. Kudoh, H. Amano, "RHiNET-1/SW: One-chip switch ASIC for a local area system network", Proc. COOL Chips III, Apr. 2000 to appear [8] S. Nishimura, T. Kudoh, H. Nishi, K. Harasawa, N. Matsudaira, S. Akutsu, K. Tasyo, and H. Amano, "A network switch using optical