InfiniBand: An Overview
Everything changes. In the early 90’s the microprocessor was a prized possession.
By the year 2000, PCs were running microprocessors at GHz of clock speeds. But the
way, in which I/O was carried out, remained much or less, the same. The processor
is now capable of delivering data at blistering speeds but the I/O subsystem that is
supposed to accept it, is incapable of receiving the same. The bottleneck is the
shared bus architecture.
Problems with PCI
The various components connected to the bus vie for the control of the bus. A prime
example of this is the familiar Peripheral Component Interconnect (PCI) bus .
Fig 1. The shared PCI bus in the system architecture
Source: How PCI works? by Jeff Tyson
As shown in the Fig 1 the PCI devices are all attached to a parallel PCI bus, which
they all contend for. In this kind of a scenario, contention is inevitable. The
performance chart is shown below.
Shared Parallel Bus
Fig 2. Table of PCI standards.
*Double Data Rate ** Quad Data Rate
Though the maximum bandwidth shown in the tables looks enormous, the fact is that
the bandwidth at hand, turns out be about 533 Mbps for PCI 66 MHz version. Also,
due to the shared nature of the PCI bus, as the frequency of operation is increased,
the fanout has to be lowered. This means that the number of devices that can be
attached to the bus decreases. So PCI does not look like a viable option for the next
generation I/O systems, though it looks poised to exist for quite some time due to
wide market acceptability. What could be the solution to the bus contention issue? It
is the use of Serial Switched Architectures. InfiniBand is a technology that employs a
serial switched architecture.
InfiniBand to the Rescue
Only a technology that is implemented very close to the processor memory bus can
be seen as a replacement for PCI. InfiniBand (IB) breaks through the bandwidth and
fanout constraints posed by the PCI bus by moving to a serial switched fabric
architecture. Now the question is that when there are already certain established
networking technologies like Fibre Channel (FC) and Gigabit Ethernet (GigE) which
provide the same serial switched architecture, then what is the need of a new one?
The answer can be summarized in 3 key-words :
The FC technology is a proven technology in the field of data storage. GigE is also
coming up in a big way. Networking is the USP of GigE. But what about server
clustering? Server clustering needs a low overhead, quick messaging service that is
very reliable. This is where InfiniBand scores. Unlike other networking technologies
InfiniBand is designed to bypass the multi-layered protocol-processing overhead. The
comparison in other areas is shown in the graphic.
Server I/O &
fabric and H/W
No form factors or
No form factors or
Fig 3. Differences between technologies.
Source: Understanding InfiniBand by Gene Risi & Philip Bender
Components of InfiniBand
System Area Layout
Fig 4. InfiniBand topology
Fig 4. Shows the InfiniBand topology in its most basic form. The node could be
server, a PC an I/O device like RAID subsystem. The fabric may be a single switch or
an interconnection of switches and routers. All connections in this topology are
switched i.e. they are point to point, thus eliminating congestion. Also due to the
serial nature, they require only four cables instead of the wide parallel connection of
the PCI bus.
Fig 5. An system level view of the basic topology
In the system level view (Fig 5.) there are certain elements that need explanation.
The leftmost part of the figure depicts the internals of a node. The memory controller
is connected to a Host Channel Adapter (HCA), which is the entry point of the node
into the fabric. The HCA provides an interface for InfiniBand to integrate with the
Operating System. The HCA links the node with the switch, which in-turn is
connected to a number of Target Channel Adapters (TCA). The TCA interfaces
present target I/O devices like RAID and JBOD subsystems with the InfiniBand
fabric. Each TCA serves a specific kind of target though Multi-utility TCAs are also a
possibility. These channel adapters contain ports. A single TCA/HCA can contain
more than a single port. These ports connect the node to the fabric and vice-versa.
As is evident from the fig 7. InfiniBand operates via a Network Protocol Stack. This
protocol stack has been compared with the OSI model layers for convenience.
Fig 7. InfiniBand Protocol Stack compared with the OSI network Model
Source: InfiniBand Architecture Tutorial – Hot Chips by Daniel Cassiday
(InfiniBand Trade Association)
At the top client layers communicate in the form of Transactions. These transactions
are composed of Messages that are moved through the transport layer. These
messages are then further divided into packets at the network layer as shown in the
graphic. IB routers can rout these packets across network domains. The routers use
a global identifier called GID for this purpose. For subnet routing in the data-link
layer an identifier local to the subnet is used, known as the LID . An IB switch
generally does this.
Fig 6. IB PDU s at various layers.
At the lowest layer of the stack (which corresponds to the physical and data-link
layers of the OSI model) the standards are more or less, similar to FC. InfiniBand
uses both optic Fibre cables and copper cables. The IB error rate is 10-12 and uses
8B/10B-encoding standards. 8B/10B means that for every 8 bits of data to be sent,
10 bits are actually sent over the physical cabling.
A new concept of aggregating links into physical lanes  of 4 or 12 cables is also
supported. They are known as 4X and 12X respectively. Moreover, the IB cabling is
fully duplex, i.e. a 4X channel contains 4 send and 4 receive lanes. This combination
gives a faster throughput. Though there are 4 lanes, they are a single entity for
IB incorporates a concept of segmenting bandwidths using virtual lanes (VL) .
These VLs are formed by a multiplexing arrangement where unrelated data can flow
sharing the same link. IB has configurations of 1,2,4,8 & 15 virtual lanes. V15 is
only used for network management and the rest are data lanes. By implementing
this, IB allows multipoint communication among nodes and provides better utilization
of the fabric.
IB provides a method to logically group together nodes, which are otherwise
physically distant. This is known as partitioning . It is analogous to VLAN s in
Ethernet data networks.
Virtual Interface Protocol
The Virtual Interface protocol is used at the IB transport layer and is what makes IB
different. As mentioned earlier, the main area where IB scores over FC and GigE is
clustering. For clustering heartbeat a very low latency network has to be present.
The Virtual Interface (VI)  protocol’s main motive is to reduce the latency
between communicating servers. Using network protocol architecture for cluster
heartbeat causes latency because of the overhead involved in executing the network
protocol code and due to the context switches needed to accept data in the
privileged mode of the OS. The privileged mode comes into the picture, because the
network adapter, which receives the data, has to hand it over to the OS.
The VI protocol reduces the latency by allowing the network adapter to bypass the
OS and perform functions in the non-privileged mode. VI uses certain memory like
operations to directly access buffers on the receiver. This process is known as
Remote Dynamic Memory Access (RDMA). In order to bypass the privileged mode
the OS, the various I/O and process related management functions have to be taken
up by the VI protocol. Each application that wants to send/receive creates a QueuePair (QP). A QP is a combination of a send & a receive queue at each port. An
application that wants to communicate places a Work Queue Element (WQE)  in
the send queue. From the send queue of the sender, the data is sent to receiving
queue of the receiver. When a WQE is executed, a Completion Queue Element
(CQE) is generated and placed in a completion queue. The completion queue is
used to inform the WQE parent application of the completion and also reduces the
number of interrupts generated.
There are certain functions defined for both the send and the receive queues. The
send queue can perform basic message sending, and 3 RDMA related functions
known as RDMA-read, RDMA-write and RDMA-Atomic.
For Receive Queue the only type of operation is Post Receive Buffer, which identifies
a buffer into which a client may send to or receive data from through a Send, RDMAWrite, RDMA-Read operation.
Fig 8. VI protocol communication mechanism
Source: An introduction to InfiniBand Architecture by Odysseas Pentakalos
Types of services:
IB provides 5 different types of transport services :
• Reliable Connection
• Unreliable Connection
• Reliable Datagram
• Unreliable Datagram
• Raw Datagram
Scope as a PCI replacement
IB came into the market and was immediately being touted as the PCI replacement.
But any technology takes a while to become popular in the market. PCI is an
established technology and a lot of IT professionals are at ease with PCI. In this
scenario, the chances of IB displacing PCI seem very slim. IB is making inroads into
the market, not as a competitor for PCI but as a complimentary technology. In fact,
adapters are already in the markets that provide support for both IB and PCI-X .
A comparative chart is shown in the figure:
PCI, PCI-X, DDR, QDR
Simpler for chip to chip
Quality of Service
PCB, Copper & Fiber
Low market acceptance
Fig 8. Table showing comparison between PCI and IB
Source: Introduction to the value proposition of InfiniBand by Marc Staimer (Dragon Slayer
The response to IB has been positive. As per analysts, very soon a huge percentage
of servers will be IB enabled. This growth will take place when IB becomes native
with the server motherboard. It is predicted that soon the use of IB as a technology
for clustering, storing as well as networking will ensue.
The predictions may be positive but the IT world is such that what is hot property
today may be obsolete tomorrow. So what lies in store for InfiniBand, is for time to
1. AGP – Advanced Graphics Processor
2. BW - Bandwidth
3. CPU – Central Processing Unit
4. CQE – Completion Queue Element
5. DDR – Double Data Rate
6. FC – Fibre Channel
7. GID – Global Identifier
8. GigE – Gigabit Ethernet
9. HCA – Host Channel Adapter
10. IB – InfiniBand
11. IBTA – InfiniBand Trade Association
12. ISA – Industry Standard Architecture
13. LID – Local Identifier
14. PCI - Peripheral Component Interconnect
15. QDR – Quadruple Data Rate
16. QP – Queue Pair
17. RAM – Random Access Memory
18. RDMA – Remote Dynamic Memory Access
19. SNIA – Storage Networking Industry Association
20. TCA – Target Channel Adapter
21. VI – Virtual Interface
22. VL – Virtual Lanes
23. WQE – Work Queue Element
1. InfiniBand Architecture Tutorial – Hot Chips by Daniel Cassiday (InfiniBand
2. Introduction to the value proposition of InfiniBand by Marc Staimer
(Dragon Slayer Consulting)
3. An introduction to InfiniBand Architecture by Odysseas Pentakalos
4. How PCI works? By Jeff Tyson
5. Understanding InfiniBand by Gene Risi & Philip Bender
6. Building Storage Networks - 2nd Edition by Marc Farley (Storage Networking