High-speed network switch RHiNET-2/SW and its implementation ...
High-speed network switch RHiNET-2/SW
and its implementation with optical interconnections
S. Nishimura (1) , T. Kudoh(2) , H. Nishi(2) , J. Yamamoto(2) , K. Harasawa(3) ,
N. Matsudaira(3) , S. Akutsu(3) , K. Tasyo(4) , and H. Amano (5)
RWCP Optical Interconnection Hitachi Laboratory.
(c/o Central Research Laboratory, Hitachi, Ltd.)
1-280 Higashi-Koigakubo, Kokubunji, Tokyo 185-8601, JAPAN
Real World Computing Partnership Tsukuba Research Center
Hitachi Communication Systems, Inc.
Dept. of Information and Computer Science, Keio University
Abstract--RHiNET-2/SW is a network switch always in use. If the processing power of such idle
that enables high-performance parallel computing computers could be combined, the resulting
in a distributed environment. We have produced a processing power might be comparable to that of a
prototype network-switch board (RHiNET-2/SW) supercomputer.
for the RHiNET-2 parallel-computing system. However, most high performance cluster systems
Eight pairs of 800-Mbit/s 12-channel optical consisting of personal computers or workstations
interconnection modules and a one-chip CMOS use a system area network (or a server area network:
ASIC switch LSI (a 784-pin BGA package) are SAN) such as Myrinet  as their interconnection.
mounted on a single board. This switch allows SAN provides low latency high bandwidth
high-speed 8-Gbit/s/port parallel optical data communication without discarding any packet. It
transmission over a distance of up to 100 m, and also provides high bisection bandwidth, which is
the aggregate throughput is 64 Gbit/s/board. By required for high performance parallel computing.
using a large amount of embedded memory on the Since they are designed to connect dedicated
switching LSI, RHiNET-2/SW allows low-latency, computers in a small place, both the link length and
free-topology network performance. We topology are restricted. On the other hand, high-
evaluated the reliability of each optical port by speed LANs with more than 1 Gbit/s link bandwidth
measuring BER: no errors were detected during are becoming available [5-6]. LANs provide
10 11-bit packet data transmission at a data rate of relatively flexible topology choices and longer
length of links. Nevertheless, the communication
880 Mbit/s 10 bits (BER: < 10 -11). This test result latency of most of today's commodity LANs tends
shows that the RHiNET-2/SW can provide high- to be larger than that of SANs because of their
throughput, long-transmission-length, and highly store-and-forward routing strategy. Moreover,
reliable data transmission. today's LANs support the IP protocol, consisting of
a lot of layers, which introduce overhead.
LASN (Local Area System Network) is a new
I. INTRODUCTION class of networks which has the advantages of both
SANs and LANs. As shown in Fig. 1, an LASN
Network-based parallel processing using assumes an environment in which personal
commodity components, such as personal computers and workstations are distributed within
computers, has received attention as an important one or more floors of a building (i.e. a LAN
parallel-computing environment [1-3]. In most of environment).
today's offices and laboratories, there are tens of
personal computers and workstations, which are not
RHiNET (RWCP High performance network) is cables are used for electrical interconnection, the
the first network designed with the concepts of an transmission length is limited to about 10 m by
LASN in mind. electrical circuit drivability). To meet the
requirements of RHiNET, RHiNET-2/SW provides
large internal memory blocks and supports
topology free, reliable, low-latency and high-
bandwidth communication. To achieve such high-
speed optical transmission in RHiNET, 8.8-Gbit/s
(800-Mbit/s×11-bit) synchronized parallel optical
interconnection is used for each data link.
Synchronized parallel optical interconnection allow
s high-speed, long-transmission-length, low-latency
node-to-node interconnection [9-13].
III. OPTICAL INTERCONNECTION MODULE
Fig. 1: Schematic structure of RHiNET We use synchronized parallel 12-channel optical
transmitter and receiver modules in RHiNET-2/SW
[9, 10] (Fig. 3). The optical transmitter modules
II. CONCEPT OF THE RHINET consist of a 1.3-µm edge-emitting laser-diode (LD)
array and a single-mode-fiber (SMF) array. The
We have developed the firstprototype called channel configuration is made up of 11 low-
RHiNET-1 using 1.33-Gbit/s optical voltage-differential-signaling (LVDS)/positive-level
interconnections  and second prototype called emitter-coupled-logic (P-ECL) non-return-to-zero
RHiNET-2 using 8.8-Git/s optical interconnections. (NRZ) data signals and one LVDS/P-ECL clock
In RHiNET-2, PCs are interconnected via high- signal. The P-ECL output signals from the receiver
throughput 8×8 network switches (the RHiNET- module are converted to LVDS signals with a level
2/SW) . RHiNET-2/SW has eight optical input converter. The input clock signal is used to latch
and eight optical output ports (Fig. 2). Each port the 11 data signals in the transmitter and receiver
has an 8-Gbit/s transmission capacity, and the modules in order to eliminate skew caused by the
aggregate throughput is 64 Gbit/s. A 12-bit logic LSIs. A transmission length is up to 100 m
synchronized parallel optical signal is converted to and total throughput is up to 8.8 Gbit/s/module
a 12-bit electrical signal in the optical receiver, (800-Mbit/s×11-bit data and 1-bit clock). To
switched by the SW-LSI, and re-converted to a 12- achieve high-density implementation with high-
bit parallel optical signal in the optical transmitter. speed signaling devices, we have to overcome many
The transmission length is limited to 100 m by the complex problems, as such as crosstalk, skew and
skew of the fiber ribbon (however, when copper propagation-loss.
12-channel optical receiver 12-channel optical transmitter
10+2 10+2 10+2
optical input 2 RX 2 optical output
800 Mbits × 800 Mbit/s ×
≅≅ 10 bits/port; 10 bits/port;
with clock & framing 7 7 with clock & framing
10+2 10+2 10+2
printed circuit board
(differential) LVDS 800 Mbit/s × 10 bits/port; with clock & framing
Fig. 2: Schematic structure of RHiNET-2/SW
Fig. 3: 12-channel parallel optical
interconnection modules [9, 10]
Fig. 4: Block diagram of the switch core in the
SW-LSI for RHiNET-2/SW
IV. SWITCHING LSI
We developed a 64-Gbit/s/chip high-throughput
CMOS switching LSI for the RHiNET-2/SW (Figs.
2, 4 and 5). This switching LSI has eight input and
eight output ports. Both input and output ports
consist of 10-bit data signals, a clock signal, and a
framing signal of 800 Mbit/s. The core switch logic
operates at a clock rate of 100 MHz. Therefore, 1:8
demultiplexers are provided at the input ports and
1:8 multiplexers are provided at the output ports.
8×10-bit incoming data are transformed to the 80-
bit data by the demultiplexer. ECC decoders and
encoders are provided at the input and output ports
respectively. The ECC decoder decodes the 80-bit
data to a 66-bit data and is handled by the core
logic. Input signals synchronized with the
transmission clock are retimed to be in-phase with
the baseclock (200 MHz) in the elastic buffer.
Since the source synchronous clocking is used, an
Fig. 5: Floor plan of the SW-LSI
elastic buffer is provided at each input port to
compensate the difference between the transmission
clock and the baseclock of up to 100-ppm.
All electrical I/O interfaces are 2.5-V LVDS-
CMOS devices. To achieve high-speed I/O, rise and B. Switching functions
fall times (< 0.3 ns) and a signal skew (< 0.3 ns) RHiNET-2/SW has the following features (Figs. 4
must be very small. We used 0.18-µm technology and 5):
to fabricate the LSI. The LSI-package is a 784-pin 1) Asynchronous wormhole routing
ball grid array (BGA), the pin-pitch of the package Store and forward routing, which is
is 1.27 mm, and the package size is 39×39 mm. commonly used in conventional LAN
There are 384 high-speed signal pins (12 switches/routers, yields a large latency.
bits/port×8 ports×2 pins [differential] for input and Wormhole routing achieves low latency
output; data rate: 800 Mbit/s/pin). We customized switching, since a switch can simultaneously
the assignment of the LSI pins to achieve high- transmit the first part of a packet if possible,
speed, low-crosstalk data I/O with a compact, high- even while receiving the latter part of the same
density circuit board. packet . However, the performance of pure
wormhole routing is severely degraded when a
message is multicast in a loaded network. To
cope with this problem, asynchronous wormhole
routing (which provides a certain size of packet damaged header. A handshake packet includes
buffer) is adopted. programmable “almost full flags” of all VCs. A
2) No packet discarding ping/pong packet reports its own logical ID,
The switch never discards packets even when physical ID, and port ID. Command packets are
the network is severely congested. used to exchange information between maintenance
3) In-order delivery processors of adjacent nodes. They are
The network ensures in-order delivery of immediately forwarded to the maintenance
packets. processor when received. The payload of the
4) Free topology design while avoiding command packet is the message to the maintenance
The switch avoids deadlock by providing a
number of VCs (virtual channels) at each input
port. By using a different VC as a packet travels
through the switches, no cyclic dependency is
generated. RHiNET-2/SW has 16 VCs at each
input port. This means the diameter of the
network can be up to 16. Since each switch has
eight ports, this number of VCs provides
virtually free topology of the network.
5) Supports up to 100-m links
RHiNET-2/SW supports optical links up to100
m long. Since an optical signal propagates in
the fiber at the speed of 5 ns/m, the round-trip
delay of a 100-m-long optical link is about 1 µs.
The handshake logic of the switch also yields
some delays. Therefore, handshake will produce
Fig. 6: RHiNET-2/SW packet format
up to 1.5-µs delay. When data rate is 8 Gbit/s,
1.5-µs delay corresponds to the time to transfer
1.5 kbytes. Therefore, to receive data without D. Routing
discarding anything, the receiver side should
send a handshake message to stop transmission Routing is done according to the routing
when it does not have enough usable memory information statically stored in the routing table of
space (less than 1.5 kbytes + maximum packet each switch. Each switch has a routing table with
size) in the input buffer. Such a flow-control 65,536 entries. An entry is a 9-bit full bitmap of
mechanism is called the slack buffer . In the outputs (8 bits correspond to the output ports,
RHiNET-2/SW, each of the 16 VCs of an input and one bit corresponds to the maintenance
port provides a 4-kbytes slack-buffer processor), and setting multiple bits of an entry
mechanism. Multiple slack buffers are therefore provides multicasting. And the routing ID of a
provided for each input port. header of a data packet is used as an entry id of the
6) Multiple-bit-rate support routing table. The maintenance processor sets the
Each port can be set to the bit rate of 8 Gbit/s, entries of the routing table. For example, if
2 Gbit/s or 1 Gbit/s. The slower bit rates are destination routing is used and there is no
provided to support slower network interfaces. multicasting, an entry of the routing table can
The maintenance processor sets the bit rate. correspond to a node; thus, a maximum of 65,536
nodes can be supported.
Figure 6 shows the packet format of RHiNET- E. Maintenance processor and hot-plug support
2/SW handled in core logic (64-bit data). RHiNET- An on-chip maintenance module and an off-chip
2/SW supports variably sized packets. A data maintenance processor are provided to configure
packet contains a maximum data size of 2 kbytes. routing tables and support dynamic link detection.
The hop counter is incremented when a packet goes While a link has not been established, RHiNET-
through a switch and is used to detect an irregularly 2/SW continuously transmits ping packets. When a
routed packet caused by wrong routing table or switch receives a ping packet, it replies with a pong
packet and then the link between the two switches is LSI socket. This socket was designed specifically
established. The ping-and-pong packet includes for the high-speed LSI (bandwidth: DC to 6 GHz;
the sender's physical ID (Fig. 6). By receiving a path inductance: <1 nH; and capacitance [signal-to-
ping or a pong packet, the maintenance processor signal]: < 1 pF). Each port has 800-Mbit/s×12-bit
obtains the physical ID of the switch at the other optical I/O channels and uses one pair of the 12-
end of the link. Then, the maintenance processors channel parallel optical interconnection modules.
of the switches exchange the necessary information The board size is 220×270 mm. The eight pairs of
to set the routing table. RHiNET-2/SW transmits optical transmitter and receiver modules are
handshaking packets in regular intervals during a mounted near the SW-LSI. The daughter board has
link is established. It then detects the link an H8 microprocessor subboard to control the
disconnection if it receives no handshake packet for maintenance-signals of the SW-LSI. A crystal
a certain period of time. In such a case, RHiNET- oscillator is mounted to generate the 200-MHz
2/SW starts to transmit ping packets again. internal clock signal. The structure and layout of
the circuit board are optimized for high-speed,
high-density implementation .
V. HIGH- DENSITY IMPLEMENTATION OF HIGH- Figure 8 shows a photograph of the RHiNET-
SPEED SIGNALS 2/SW. These are four sockets of the four-by-twelve-
channel fiber adapters. The motherboard is
In RHiNET-2/SW, to realize high-speed, high- mounted here on the upper layer of the cabinet.
density integration with the optical interconnection The power supply unit and maintenance processor-
module and SW-LSI, we employ a MULTIWIRETM * card are packaged here in the lower layer.
interconnect board (MWBTM) as a printed circuit
board. The MWBTM can achieve high wiring density
and superior electrical characteristics (low-loss,
high-accuracy 50-ohm impedance, and low-
reflection). The MWBTM uses copper wires (0.1 mm
diameter) that are coated with polyimide insulation
and can therefore be cross wired. This also accounts
for the high wiring density (a 0.3 mm wire-pitch
can be achieved). Since constant diameter wires are
incorporated in the MWBTMs, controlled
characteristics impedance (Z0: 50 Ω) can be easily
realized. Furthermore, by utilizing very thin wires,
with adequate spacing, crosstalk, and bending-loss
We measured the physical characteristics of the (a) SW-LSI mounted side
MWBTM (propagation-loss and crosstalk). In 150-
mm-long straight wires, the –3-dB-down bandwidth
was greater than 2.4 GHz, and the crosstalk on the
receiver side was less than 1.2% at 900 MHz and a
wire-pitch of 0.5 mm. We then optimized the layout
of the circuit board based on the experimental
results to realize low-crosstalk, high-speed, and
high-density electrical I/O .
We have produced a prototype of the RHiNET-
2/SW eight-by-eight network switch (Fig. 7). In the
center of the board, the SW-LSI is mounted in an (b) Optical modules mounted side
Fig. 7: Layout of the motherboard of the
*: MULTIWIRE is a trademark owned by ADVANCED RHiNET-2/SW
INTERCONNECTION TECHNOLOGY, INC.
the optical receiver module and reconverted into
electrical signals and sent to the error-rate detector
(ERD). The fiber runs were 50 m long.
Figure 10 shows the eye-pattern of the
measured electrical output signal and the waveform
of the clock signal. A clear eye-pattern was
obtained. The signal rise time (Tr) and fall time
(Tf) of the electrical output signal were both less
than 400 ps. The jitter was less than 100 ps. We
evaluated the reliability of each optical port by
measuring the BER. We observed no errors during
10 11-bit packet data transmission at a data rate of
880 Mbit/s×10 bits. (This corresponds to a BER of
less than 10 -11.) We used a 2 24-1 pseudo-random
word sequence (PRWS) as a data pattern. These test
results show that the reliability of the I/O ports in
RHiNET-2/SW is sufficient for RHiNET-2 and that
our high-speed and high-density circuit-board
layout enables us to construct a high-performance
Fig. 8 Photograph of the cabinet
VII. EVALUATION TEST RESULTS
We measured the signal eye-pattern by
oscilloscope and bit-error (BER) rate by error-rate
detector (the measurement setup is shown in Fig.
8). The 800-Mbit/s×12-bit electrical data signals
were generated by the data generator (DG) as a
clock signal (CLKI), a framing data signal (AI), and Fig. 10: Measured eye-pattern of an electrically
10-bit packet data signals (DI[9..0]). These 12-bit re-converted 0th data bit [D0] and waveform of the
electrical signals were converted to 12-bit optical clock signal [CLK] (200 mV/div; 250 ps/div; data
signals by the optical transmitter module and rate: 800 Mbit/s).
transmitted through the 12-channel fiber ribbon.
The optical signals were input to an RX-port of
RHiNET-2/SW. In RHiNET-2/SW, the 12-bit optical To achieve highly reliable (error-free) parallel
input signals were converted to electrical signals in interconnection, suppressing skew is the most
important improvement that must be made. This is
the RX-port, propagated through the SW-LSI, and even more important than improving the sensitivity
reconverted to optical signals in the corresponding and bandwidth. Our system requirement was that
TX-port, then transmitted from the TX-port as the skew be suppressed to within 20% of the clock-
optical signals. The output signals were received by cycle. In the case of 800 Mbit/s transmission, the
Fig. 9: Experimental setup to measure the signal eye-patterns and BER of RHiNET-2/SW.
skew should be suppressed to less than 250 ps. To VIII. RELATED WORKS
suppress the skew, we used high-speed LVDS
electrical circuits, and precisely controlled the
lengthwise placement of the wires. To suppress the Myrinet  is one of the most popular SAN
skew between data signals, the 800-Mbit/s × 11-bit widely used for cluster computing. Myrinet
synchronized parallel data signals were retimed with switches never discard any packets, and provides
an 800-MHz clock signal using gate-latching in the reasonably high link bandwidth (1.28 Gbit/s) and
TX and RX modules and at the TX- and RX-ports very low latency. However, Myrinet switches
of the RHiNET-2/SW. The fiber length was 50 m. support fewer number of virtual channels.
Therefore, network topology restricted so as to
We measured the skew of the 10-bit data signal avoid deadlock by using carefully selected routing
at the two points of the RHiNET-2/SW using a setup paths.
shown in Fig. 9. These two points were the electrical
output pins of SW-LSI, and the optical output port GSN , is high bandwidth and low latency
of the optical transmitter module (Fig. 11). In the interconnect standard, which provides 6.4 Gbit/s
input port of the SW-LSI, the skew was eliminated link bandwidth of error-free, and flow controlled
by the gate-latching, but the output signal of the data. Although it provides four virtual channels for
SW-LSI had a 141-ps skew caused by each link, it is difficult to support a deadlock free
nonuniformity of the LSI output port. We routing in a free topology because of the channel
eliminated this skew by using gate-latching in the number limitation. A fat tree is used in the cluster
optical TX module, and the skew of the optical using GSN.
output signal from the output port was 19.4 ps. The Compaq uses the SC interconnect  for its
maximum skew of our 50-m-long 12-channel fiber inter-server connection. The SC Interconnect
ribbon was 50 ps. Thus, after 50-m-long fiber consists of a high-bandwidth crossbar switch and a
transmission, the worst-case fiber skew is 69.4 ps.
Therefore, the skew of the data-signal is sufficiently PCI adapter for each node. The detailed
suppressed by the gate-latching, and thus supports architecture of the SC interconnects is not
high-speed and highly reliable synchronized disclosed. However, it also uses a fat tree topology
parallel data transmission. to keep a high degree of bisection bandwidth
We have developed the RHiNET-2/SW network
for high-performance computing using personal
computers distributed in an office or floor
environment. Optical interconnection allows high-
speed, highly reliable data transmission over a long
distance. To achieve high-speed and low-latency
node-to-node interconnection, we implemented
eight pairs of 8.8-Gbit/s optical interconnection
modules and a 64-Gbit/s SW-LSI in a compact
circuit board. We have produced an optical
interconnection module for RHiNET-2/SW that is
capable of speeds of up to 8.8 Gbit/s and a one-
chip CMOS ASIC switch (784-pin BGA). RHiNET-
Fig. 10 Skew of 10-bit data based on the edge of 2/SW has eight input and eight output optical data
the 0th data bit (in I/O port 4). ports. The bandwidth of each port is 8 Gbit/s
(aggregate throughput of the switch is 64 Gbit/s).
We developed a high-speed, high-density
implementation technology to overcome electrical
problems such as signal propagation-loss and
crosstalk. All of the electrical interfaces are
composed of high-speed CMOS-LVDS logic. The
structure and layout of the circuit board is
optimized for high-speed, high-density
implementation. Our prototype system achieved
880-Mbit/s×10-bit parallel data transmission. We
observed no errors during 10 11-bit packet data interconnection for high performance parallel
transmission at a data rate of 880 Mbit/s×10 bits computing using PCs", pp. 5-12, Anchorage
with a 50-m fiber. (This corresponds to a BER of U.S.A., Oct. 1999.
less than 10-11.) We have thus successfully produced  A. Takai, T. Kato, S. Yamashita, S. Hanatani, Y.
a compact high-throughput optical I/O network Motegi, K. Ito, H. Abe, and H. Kodera, "200-
switch using a one-chip SW-LSI and eight pairs of
optical interconnection modules. This switch Mb/s/ch 100-m Optical Subsystem
enables high-performance parallel computing in a Interconnections Using 8-Channel 1.3-µm
distributed computing environment. Laser Diode Arrays and Single-Mode Fiber
Arrays", J. of Lightwave Technology 12, pp.
A CKNOWLEDGEMENT 260-270, 1994.
We are grateful for the assistance and advice of 4.htm
Takahiko Takahashi and Kazuyoshi Satoh of the  J. W. Goodman F. I. Leonberger, Sun-Yuan
Device Development Center, Hitachi, Ltd., Atsushi Athale, and R. A. Kung, "Optical interconnects
Takai and Atsushi Miura of the Telecommunication for VLSI system", Proceedings of the IEEE 72,
and Information Infrastructure Systems Group, pp. 159-174, July 1984.
Hitachi, Ltd., T. Keicho of Hitachi ULSI Systems  D. A. B. Miller and H. W. Ozaktas, "Limit to
Co., Ltd., Y. Keikoin and K. Ohsugi of Hitachi the Bit-rate Capacity of Electrical
Information Technology Co., Ltd., and M. Tanaka Interconnection from the Aspect Ratio of the
of Hitachi Communication Systems, Inc. System Architecture", Journal of Parallel and
Distributed Computing 41, pp. 42-52, 1997.
REFERENCES  S. Nishimura, H. Inoue, H. Matsuoka, and T.
Yokota: "Optical interconnection subsystem
 T. Kudoh, J. Yamamoto, F. Sudoh, H. Amano, used in the RWC-1 massively parallel
Y. Ishikawa, and M. Sato: "Memory based light computer", IEEE Journal of Selected Topics
weight communication architecture for local on Quantum Electronics 5, pp. 360-367, 1999.
area distributed computing'', Innovative  http://www.llnl.gov/asci/bluemtn/
architecture for future generation high-  http://www.compaq.com/hpc/news/news_sc_ann
performance processors and systems, IEEE ounce_p3.html
Computer Society Press, pp. 133-139, 1997.
 L.M. Ni, "Should Scalable Parallel Computers
Support Efficient Hardware Multicast",
Proceeding of 1995 Int'l Conference on
Parallel Processing Workshop on Challenges
for Parallel Processing, pp. 2-7, August 1995.
 T. Horie, H. Ishihara, T. Shimizu, and M.
Ikesaka, "AP1000 Architecture and
Performance of LU Decomposition",
Proceedings of 1991 Int'l Conference on
Parallel Processing, pp.634-635, August 1991.
 HIPPI-6400 working drafts, T11.1
maintenance drafts of ANSI NCITS
 IEEE802.3 Higher Speed Study Group
 H. Nishi, K. Tasho, T. Kudoh, H. Amano,
"RHiNET-1/SW: One-chip switch ASIC for a
local area system network", Proc. COOL Chips
III, Apr. 2000 to appear
 S. Nishimura, T. Kudoh, H. Nishi, K. Harasawa,
N. Matsudaira, S. Akutsu, K. Tasyo, and H.
Amano, "A network switch using optical