Page 1
Why 10 Gigabit Ethernet? Why Now?
With the advent of Ultra Low-Latency Switches, such as the Nexus 5000 which provides a consistent sub
3 uSec latency, regardless of load or packet size, End-to-End latency is becoming more important. From
an End-to-End Latency perspective, 90% of latency is In-Host, as opposed to In-Network. In addition to
faster switches and decreased serialization delay, 10GE NIC technology allows for lower CPU Utilization
and reduced In-Host latency.
Figure 1: Cisco Nexus 5000 Series 10 Gigabit Ethernet Switches
http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/data_sheet_c78-461802.html
Nexus 5000 Data Sheet
What About Infiniband?
Infiniband (IB) came about with the promise of Ultra Low Latency and Low CPU Utilization. With this
came a new set of problems. Ethernet has become the ubiquitous standard in the industry. We are
even starting to see conventional High Performance Computing vendors such as Myricom and Voltaire
and SAN vendors such as Brocade develop Ethernet products as Network, Storage, and HPC
environments are being converged into a single Unified 10GE Fabric. When communicating outside of
your LAN, to an exchange, for example, traffic must pass through an Infiniband Gateway to translate IB
to Ethernet, making any theoretical latency gain negligible in real-world scenarios. To take advantage of
the latencies Infiniband promised, applications had to be re-written to use RDMA, a sacrifice many were
not willing to make. IB also does not provide features that are standard in the Ethernet world such as
ACLs or QoS. In addition, conventional Network Monitoring Tools and Sniffers do not work with IB.
Page 2
What is RDMA?
RDMA, or Remote Data Memory Access, is a technology that allows a sender to write directly to a
receiver’s memory, bypassing the kernel. With conventional NICs, packets entering a NIC are processed
by the server’s CPU using the Operating Systems UDP/IP Stack. This process requires multiple
interrupts, context switches, and copies of the data before it ends up in application memory, available
for use.
Figure 2: Conventional Server I/O
In an RDMA environment, this process is much simpler. It is called Kernel Bypass/Zero Copy. With
RDMA, the packet is processed by the NIC and is then copied directly to application memory without
requiring processing by the CPU. This ultimately produces reduced In-Host Latency and lower CPU
Utilization.
Figure 3: Kernel Bypass/Zero Copy Server I/O
Page 3
What Cables Can I Use?
10GbaseT
10GbaseT will allow for 10GE speeds over Cat6e cables. This is the eventual low cost solution for
10G/1G/100Mbps communication; however the technology is in its infancy. Today’s 10GbaseT Phys
consume ~8W and induce 2.5 uSecs of latency per port. This will eventually be reduced and will be
incorporated into future Cisco products; however it is not supported today.
TwinAx (CX1)
The current low-cost, low-power solution for 10GE is TwinAx cabling. This consists of a copper (CX1)
cable with an SFP+ Transceiver directly attached each end.
Note: Each Transceiver induces an additional 50 nSec of latency, 100 nSec total per cable.
Figure 4: TwinAx Cable
SFP+
SFP+ provides the lowest latency solution today. With a variety of SFP+ transceivers available for multi
mode and single mode fiber, there are plenty of options for 10GE cabling without the added latency of
10GbaseT or TwinAx. With its smaller form factor, lower cost, and lower power consumption compared
to previous X2 and XENPAK transceivers, SFP+ allows for much higher port densities than previously
possible. The Nexus 5010 currently supports up to 26 Line Rate 10GE Ports in a compact 1 RU form-
factor with the 5020 providing 52 Line Rate 10GE ports in 2 RUs. SFP+ transceivers look, smell, and feel
like SFP transceivers, however they operate at 10 Gbps speeds. A limited number of SFP+ ports will also
accept GE SFP Transceivers for backwards compatibility.
Page 4
What NICs Should I Use?
iWARP
iWARP utilizes RDMA over Ethernet instead of Infiniband. This provides the same Kernel Bypass/Zero
Copy functionality, without the need for a secondary infrastructure. However with iWARP, just as with
IB, applications must be written to the lib.ib.verbs library to take advantage of this functionality.
Key Players: NetEffect (Intel), Chelsio, Mellanox, ServerEngines
Supported Operating Systems: Linux
User Space APIs
Numerous NIC vendors are now developing User Space APIs which give you all the benefits of iWARP,
without having to re-write your application. This middleware translates between sockets programming
and the hardware.
Key Players: Myricom and Solarflare
Figure 5: User Space Library Software Block Diagram
MX (Myrinet Express)
Myricom has their roots in High Performance Computing. They originally developed a HPC protocol
called Myrinet, but have since ported their development toward 10GE.
Key Players: Myricom
Supported Operating Systems: Linux and Windows
TCP/UDP Acceleration
Page 5
OpenOnload
Key Players: Solarflare
Supported Operating Systems:
TCP/UDP Acceleration
Figure 6: OpenOnload Software Block Diagram
Page 6
SR-IOV (Single Root Input/Output Virtualization)
SR-IOV was originally designed for a Virtual Machine environment. SR-IOV allows for a single 10GE NIC
to be divided into multiple Virtual NICs (vNIC), which are then mapped to Virtual Machines. This same
concept can be applied in a non-virtualized environment, mapping each vNIC to Application Memory
once again providing Kernel Bypass/Zero Copy functionality.
Key Players: Server Engines (Chelsio, NetEffect, Mellanox, Broadcom, and Neterion in Future)
Supported Operating Systems:
TCP/UDP Acceleration
Figure 7: SR-IOV in a Virtualized Server Environment
Page 7
How Does This Affect My Applications?
Cisco has teamed with NetEffect (Intel), to provide a solution which provides the theoretical advantages
of Infiniband, without the drawbacks. Cisco and NetEffect combined forces to write a middleware called
RAB, or RDMA Accelerated Buffers, which is optimized for use with Wombat Data Fabric. Cisco is also
exploring another middleware called DAL, or Datagram Acceleration Layer which could be used with
TIBCO RV, or any other application using UDP Multicast. This middleware allows for the decreased CPU
Utilization and reduced In-Host Latency with no modifications to your application.
Figure 8: RAB and DAL Middleware Software Block Diagram
Conventional Server I/O requires packets to be processed by the server’s CPU using the Operating
System’s UDP/IP Stack. This involves multiple interrupts, context switches, and copies of the data. This
ultimately leads to high CPU Utilization and unnecessary In-Host Latency.
Page 8
Figure 9: Conventional UDP/IP Communication
DAL Middleware intercepts conventional sockets calls and writes them directly to the NetEffect NIC,
providing the Kernel Bypass/Zero Copy functionality without the headaches of re-writing your
application, as was required with Infiniband.
Figure 10: Kernel Bypass, Zero Copy Communication with DAL
Page 9
How Does This Impact Latency?
As mentioned earlier, with today’s low latency networks, the source of 90% of latency is actually within
the server itself, rather than in the network. We performed a baseline test and found ping pong latency
to be on the order of 35-40 uSec with 30 uSec residing within the server and only 7 uSec from the core
switching infrastructure.
Figure 11: Sources of End to End Latency
We are seeing Market Data and High Performance Computing environments move to 10GE not only for
added throughput, but for reduced latency. This results in Serialization Delay being reduced by an order
of magnitude. For Jumbo Size Frames, this will result in latency being decreased from 72 to 7.2 uSec as
seen below.
Table 1: Serialization Delay Comparison
Page
10
Furthermore, by utilizing the User Space APIs and the Kernel Bypass/Zero Copy functionality they
provide, we have seen Application Layer to Application Layer Latency reduced to less than 6 uSec.
Table 2: Latency Comparison
Overall, the end result of moving from GE to 10GE is an overall End-to-End latency decrease of over
80%.
Figure 12: End to End Latency Comparison

Why 10 Gigabit Ethernet Draft v2

  • 1.
    Page 1 Why 10Gigabit Ethernet? Why Now? With the advent of Ultra Low-Latency Switches, such as the Nexus 5000 which provides a consistent sub 3 uSec latency, regardless of load or packet size, End-to-End latency is becoming more important. From an End-to-End Latency perspective, 90% of latency is In-Host, as opposed to In-Network. In addition to faster switches and decreased serialization delay, 10GE NIC technology allows for lower CPU Utilization and reduced In-Host latency. Figure 1: Cisco Nexus 5000 Series 10 Gigabit Ethernet Switches http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/data_sheet_c78-461802.html Nexus 5000 Data Sheet What About Infiniband? Infiniband (IB) came about with the promise of Ultra Low Latency and Low CPU Utilization. With this came a new set of problems. Ethernet has become the ubiquitous standard in the industry. We are even starting to see conventional High Performance Computing vendors such as Myricom and Voltaire and SAN vendors such as Brocade develop Ethernet products as Network, Storage, and HPC environments are being converged into a single Unified 10GE Fabric. When communicating outside of your LAN, to an exchange, for example, traffic must pass through an Infiniband Gateway to translate IB to Ethernet, making any theoretical latency gain negligible in real-world scenarios. To take advantage of the latencies Infiniband promised, applications had to be re-written to use RDMA, a sacrifice many were not willing to make. IB also does not provide features that are standard in the Ethernet world such as ACLs or QoS. In addition, conventional Network Monitoring Tools and Sniffers do not work with IB.
  • 2.
    Page 2 What isRDMA? RDMA, or Remote Data Memory Access, is a technology that allows a sender to write directly to a receiver’s memory, bypassing the kernel. With conventional NICs, packets entering a NIC are processed by the server’s CPU using the Operating Systems UDP/IP Stack. This process requires multiple interrupts, context switches, and copies of the data before it ends up in application memory, available for use. Figure 2: Conventional Server I/O In an RDMA environment, this process is much simpler. It is called Kernel Bypass/Zero Copy. With RDMA, the packet is processed by the NIC and is then copied directly to application memory without requiring processing by the CPU. This ultimately produces reduced In-Host Latency and lower CPU Utilization. Figure 3: Kernel Bypass/Zero Copy Server I/O
  • 3.
    Page 3 What CablesCan I Use? 10GbaseT 10GbaseT will allow for 10GE speeds over Cat6e cables. This is the eventual low cost solution for 10G/1G/100Mbps communication; however the technology is in its infancy. Today’s 10GbaseT Phys consume ~8W and induce 2.5 uSecs of latency per port. This will eventually be reduced and will be incorporated into future Cisco products; however it is not supported today. TwinAx (CX1) The current low-cost, low-power solution for 10GE is TwinAx cabling. This consists of a copper (CX1) cable with an SFP+ Transceiver directly attached each end. Note: Each Transceiver induces an additional 50 nSec of latency, 100 nSec total per cable. Figure 4: TwinAx Cable SFP+ SFP+ provides the lowest latency solution today. With a variety of SFP+ transceivers available for multi mode and single mode fiber, there are plenty of options for 10GE cabling without the added latency of 10GbaseT or TwinAx. With its smaller form factor, lower cost, and lower power consumption compared to previous X2 and XENPAK transceivers, SFP+ allows for much higher port densities than previously possible. The Nexus 5010 currently supports up to 26 Line Rate 10GE Ports in a compact 1 RU form- factor with the 5020 providing 52 Line Rate 10GE ports in 2 RUs. SFP+ transceivers look, smell, and feel like SFP transceivers, however they operate at 10 Gbps speeds. A limited number of SFP+ ports will also accept GE SFP Transceivers for backwards compatibility.
  • 4.
    Page 4 What NICsShould I Use? iWARP iWARP utilizes RDMA over Ethernet instead of Infiniband. This provides the same Kernel Bypass/Zero Copy functionality, without the need for a secondary infrastructure. However with iWARP, just as with IB, applications must be written to the lib.ib.verbs library to take advantage of this functionality. Key Players: NetEffect (Intel), Chelsio, Mellanox, ServerEngines Supported Operating Systems: Linux User Space APIs Numerous NIC vendors are now developing User Space APIs which give you all the benefits of iWARP, without having to re-write your application. This middleware translates between sockets programming and the hardware. Key Players: Myricom and Solarflare Figure 5: User Space Library Software Block Diagram MX (Myrinet Express) Myricom has their roots in High Performance Computing. They originally developed a HPC protocol called Myrinet, but have since ported their development toward 10GE. Key Players: Myricom Supported Operating Systems: Linux and Windows TCP/UDP Acceleration
  • 5.
    Page 5 OpenOnload Key Players:Solarflare Supported Operating Systems: TCP/UDP Acceleration Figure 6: OpenOnload Software Block Diagram
  • 6.
    Page 6 SR-IOV (SingleRoot Input/Output Virtualization) SR-IOV was originally designed for a Virtual Machine environment. SR-IOV allows for a single 10GE NIC to be divided into multiple Virtual NICs (vNIC), which are then mapped to Virtual Machines. This same concept can be applied in a non-virtualized environment, mapping each vNIC to Application Memory once again providing Kernel Bypass/Zero Copy functionality. Key Players: Server Engines (Chelsio, NetEffect, Mellanox, Broadcom, and Neterion in Future) Supported Operating Systems: TCP/UDP Acceleration Figure 7: SR-IOV in a Virtualized Server Environment
  • 7.
    Page 7 How DoesThis Affect My Applications? Cisco has teamed with NetEffect (Intel), to provide a solution which provides the theoretical advantages of Infiniband, without the drawbacks. Cisco and NetEffect combined forces to write a middleware called RAB, or RDMA Accelerated Buffers, which is optimized for use with Wombat Data Fabric. Cisco is also exploring another middleware called DAL, or Datagram Acceleration Layer which could be used with TIBCO RV, or any other application using UDP Multicast. This middleware allows for the decreased CPU Utilization and reduced In-Host Latency with no modifications to your application. Figure 8: RAB and DAL Middleware Software Block Diagram Conventional Server I/O requires packets to be processed by the server’s CPU using the Operating System’s UDP/IP Stack. This involves multiple interrupts, context switches, and copies of the data. This ultimately leads to high CPU Utilization and unnecessary In-Host Latency.
  • 8.
    Page 8 Figure 9:Conventional UDP/IP Communication DAL Middleware intercepts conventional sockets calls and writes them directly to the NetEffect NIC, providing the Kernel Bypass/Zero Copy functionality without the headaches of re-writing your application, as was required with Infiniband. Figure 10: Kernel Bypass, Zero Copy Communication with DAL
  • 9.
    Page 9 How DoesThis Impact Latency? As mentioned earlier, with today’s low latency networks, the source of 90% of latency is actually within the server itself, rather than in the network. We performed a baseline test and found ping pong latency to be on the order of 35-40 uSec with 30 uSec residing within the server and only 7 uSec from the core switching infrastructure. Figure 11: Sources of End to End Latency We are seeing Market Data and High Performance Computing environments move to 10GE not only for added throughput, but for reduced latency. This results in Serialization Delay being reduced by an order of magnitude. For Jumbo Size Frames, this will result in latency being decreased from 72 to 7.2 uSec as seen below. Table 1: Serialization Delay Comparison
  • 10.
    Page 10 Furthermore, by utilizingthe User Space APIs and the Kernel Bypass/Zero Copy functionality they provide, we have seen Application Layer to Application Layer Latency reduced to less than 6 uSec. Table 2: Latency Comparison Overall, the end result of moving from GE to 10GE is an overall End-to-End latency decrease of over 80%. Figure 12: End to End Latency Comparison