SlideShare a Scribd company logo
© - Informatix Solutions, 2010Page 1 Version 1.2
Informatix
Solutions
ETHERNET V. INFINIBAND
© - Informatix Solutions, 2010Page 2 Version 1.2
Informatix
Solutions
Reliable Transport
• InfiniBand uses hardware based retransmission
• InfiniBand uses both link level and end-to-end CRC’s
• Ethernet is a best efforts delivery, allowed to drop packets
and relies on the TCP/IP protocol for reliability which is
typically implemented in SW for retransmission
• The effort to implement TCP/IP in hardware has been proven
much more challenging than what people imagined.
TCPoffloads cards have not been very successful and have not
been shown to lower latency
• TCP/IP is the major performance bottleneck for bandwidths of
10G and above
• InfiniBand delivers reliability at the hardware level providing
higher throughput, less latency and rarely causes jitter. This
enables the use without TCP/IP
2
© - Informatix Solutions, 2010Page 3 Version 1.2
Informatix
Solutions
Flow Control
• InfiniBand uses credit based flow control for each link which
means that InfiniBand switch chips can be built with much
smaller on-chip buffers than Ethernet
• Ethernet switches rely on explicit packet drops for flow
control, which requires larger buffers because the cost of
retransmission is very high.
• This technical difference enables the building larger and lower
cost switch chips for InfiniBand vs. Ethernet
• This has resulted in larger, higher density, lower contention,
lower cost InfiniBand switches with lower cost per port than
their 10Ge equivalents
• maximum 40G InfiniBand, zero contention port density is 648
ports
• maximum 10G Ethernet, zero contention port density is 384
ports
3
© - Informatix Solutions, 2010Page 4 Version 1.2
Informatix
Solutions
Switch performance
• InfiniBand has late packet invalidation which enables cut-
through switching for low latency non-blocking performance
spanning the entire fabric
• Virtually all Ethernet switches are designed for L3/L4
switching, which requires packet rewrite, and which requires
a store-forward architecture
• Where Ethernet supports cut-through, reliable deployment is
limited to local clusters and small subnets due to the need to
prevent propagation of invalid packets.
The latency impact of store forward is quite significant:
for a 500Byte packet at 1 Gbps it is 5.7 usec, for a
1500 Byte packet at 1 Gbps it is 16.3 usec
Store and forward adds this overhead for every hop!
4
© - Informatix Solutions, 2010Page 5 Version 1.2
Informatix
Solutions
Congestion Management
• InfiniBand has end to end Congestion management as part of
the existing standard (IBA 1.2)
• Congestion detection at receiver sends notification messages to
sender to reduce rate
• Policies determine recovery rates
• Ethernet relies on TCP
• Issues with current TCP congestion management algorithms
on high bandwidth, long haul circuits limit single session
throughput due to the window sizing halving for each
congestion event
© - Informatix Solutions, 2010Page 6 Version 1.2
Informatix
Solutions
TCP/IP recovery for congestion events
Courtesy of Obsidian Research
© - Informatix Solutions, 2010Page 7 Version 1.2
Informatix
Solutions
Management
• InfiniBand includes a complete management stack which
provides high levels of configurability and observability.
• Ethernet relies on TCP/IP to correct errors
• InfiniBand enables an end-to-end solution to be deployed and
run reliably without the overhead of TCP-IP
• Ethernet relies on add-ons such as trunking and spanning tree
to add resiliency into Layer 2 networks.
• Ethernet Spanning tree is active:standby and takes seconds to
switch. This switching causes multicast flooding in most
switches.
• InfiniBand preselects failover paths and switches almost
instantly
7
© - Informatix Solutions, 2010Page 8 Version 1.2
Informatix
Solutions
The challenges of reliable networks
• Reliable delivery has a downside. The packets have to be
saved somewhere before they can be delivered
• InfiniBand transfers this buffering from the TCP socket on the
server to the network
• All Credit-based networks suffer from congestion
• When receivers are not fast enough, packets build up in the
network. This backpressure slows the sender but causes
congestion on shared links potentially impacting other nodes
or applications
• The Ethernet Pause feature has similar impact and is how FCoE
achieves reliable delivery
• InfiniBand uses independent Tx/Rx buffers for each Virtual
Lane to mitigate this impact
• Requires careful QoS design to minimise the onset of
congestion and utilize the VL’s
© - Informatix Solutions, 2010Page 9 Version 1.2
Informatix
Solutions
The challenges of mesh networks
• Mesh networks present multiple active paths to link two
nodes
• Spanning tree solves this by pruning the mesh and reducing it
to one active path
• Applications, particularly those using RDMA require in-order
delivery. This can only be achieved by having a fixed path
between two nodes
• This constrains bandwidth usage and requires more
sophisticated path selection algorithms with load balancing
• Reconfiguration events can result in looping packets
consuming all bandwidth
• Topology design and path selections has to address potential
loops
© - Informatix Solutions, 2010Page 10 Version 1.2
Informatix
Solutions
RDMA
• RDMA is part of the standard with InfiniBand. It is a
mandatory requirement and has been extensively
interoperability tested. This allows multivendor
configurations to be safely deployed
• RDMA on Ethernet mostly requires a matched pair of NIC’s.
• iWARP added the (latency) cost of TCP to overcome Ethernets
reliability problems
• RDMAoEthernet is an emerging standard and relies on FCoE
reliable delivery to avoid the need for a TCP layer. This then
requires a Convergence Enhanced Ethernet NIC to be fully
deployed end-to-end for RDMA
• InfiniBand RDMA, written by OFED, is standard in Linux 2.6.14
and later. Torvold will not permit another RDMA
implementation in the stock kernel. Ethernet manufacturers
are slowly adding OFED support to their cards
10
© - Informatix Solutions, 2010Page 11 Version 1.2
Informatix
Solutions
Cabling
• Promise of 10GbaseT has always been held out to lower
Ethernet prices.
• Technical challenges of running 10G over RG45 has been
immense
• Requires 6W to drive
• Needs Cat-6A or Cat-7 cable so will rarely run over existing
infrastructure
• Uses a block level error detection scheme. Requires full block to
be loaded. Adds 2µS to every hop
• Few vendors support 10GbaseT for these reasons
• SFP+ is most popular 10G option.
• Small form factor gives same packing density as RG45
• 1W and 100nS latency
• Available in both Cu and Optical (LC format)
• Comparable to QFSP used by InfiniBand (and 40G Ethernet)
© - Informatix Solutions, 2010Page 12 Version 1.2
Informatix
Solutions
Long Distance
• InfiniBand is commonly viewed as fit for local clusters only.
This is incorrect and was caused by the fat and short copper
cables.
• InfiniBand and Ethernet share the same cabling at 40G an
above. At 10G cables are similar (SFP+ v. QFSP) and have
similar physical constraints
• InfiniBand Fibre cables are available up to 4km
• The higher scalability of InfiniBand subnets (48K ports) means
that remote sites can be safely bridged without incurring the
penalties of routing delays
• Long distance InfiniBand switches provide the necessary
packet buffering to support distances of thousands of miles -
e.g. US DoD coast-to-coast InfiniBand fabric
12
© - Informatix Solutions, 2010Page 13 Version 1.2
Informatix
Solutions
Advantages of InfiniBand over Ethernet
• Cut through design with hardware generated link and
end2end CRC's and late packet invalidation
• Avoids packet buffering required by Ethernet
• 5uS compared with 20uS Ethernet latency
• Implicit layer 2 trunking, bundles of 1,4,12 physical links into a
single “logical” channel. Handled transparently by the
hardware
• Ethernet trunking is vendor option, implemented in NIC driver rather than hardware.
• Ethernet confused by competing standards e.g. Cisco Etherchannel
• Ethernet does not stripe an individual packet whilst InfiniBand does
• Standardized RDMA to lower CPU overhead
• Ethernet currently has vendor specific RDMA, requires matched pairs of cards and device
driver support. Effort to standardize ongoing in Ethernet community
• Legacy Ethernet protocol constrains large switch
implementation – max possible today:
• InfiniBand 648 port zero contention 40Gbps, 3052 ports at 20Gbps, no contention.
• Cisco Nexus 7000 32 port 10G blade with a total of 8 per chassis (256 ports) but limited
to 80G fabric i.e. 8:1 contention. Still not shipping.
© - Informatix Solutions, 2010Page 14 Version 1.2
Informatix
Solutions
Ethernet Packet Drops and discards
User
Kernel
context switch
Socket buffers
TCP/IP
NIC Buffers
App buffer
TOR switch
buffers
Router buffers
Server 1 Server 2
TOR Switch
1:2.4 contention
Core Routers
1:10 contention
Queued using flow control
Discarded when full
Error Detection and Correction
By TCP
User
Prog
User
Prog
User
Prog
User
Prog
User
Kernel
context switch
Socket buffers
TCP/IP
NIC Buffers
© - Informatix Solutions, 2010Page 16 Version 1.2
Informatix
Solutions
Impact on latency and jitter
Whilst latency savings are only
small for 10ge v. InfiniBand,
their is a big advantage with
less jitter for InfiniBand
compared with Ethernet
Diagram courtesy of STAC research
© - Informatix Solutions, 2010Page 17 Version 1.2
Informatix
Solutions
Multicast Latency RTT in microseconds
Percentage distribution
NanosecondRTT
MC loopback onto new group
MC sender
and MC receiver
10ge was Cisco Nexus 5010
-
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
500,000
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 99
1ge 2.6.30.1
10ge RH5.3 Nexus
IB 2.6.27.10
IB 2.6.30.1
IB Verbs 2.6.30.1
Interface BW Min
(µS)
Avg
(µs)
Median
(µS)
Max
(µS)
STD
(µS)
Ethernet 1G 41.64 92.14 92.38 151.69 23.65
Ethernet 10G 45.00 65.48 48.00 3485.0 172.2
IPoIB
(bonded)
80G 32.11 51.28 52.64 199.20 4.30
IPoIB 40G 23.22 24.86 23.99 1582 27.29
IB VERB 40G 5.79 6.21 6.14 46.90 0.51
Testing carried out at Intel fasterLAB
© - Informatix Solutions, 2010Page 18 Version 1.2
Informatix
Solutions
InfiniBand v. Ethernet Latency Distributions
18
0
1000
2000
3000
4000
5000
6000
7000
8000
45.00
64.35
83.70
103.05
122.40
141.75
161.10
180.45
199.80
219.15
238.50
257.85
277.20
296.55
315.90
335.25
354.60
373.95
393.30
412.65
434.00
2.6.18 10ge Nexus
micro second RTT
nano second RTT
0
500
1000
1500
2000
2500
3000
3500
4000
4500
23.217
23.372
23.527
23.682
23.837
23.992
24.147
24.302
24.457
24.612
24.767
24.922
25.077
25.232
25.387
25.542
25.697
25.852
26.007
26.162
26.369
2.6.30IB
#ofsamples
#ofsamples
InfiniBand
Normalised time axis
•InfiniBand may have derived some benefit from newer
kernel
•Nexus suffered a jitter for 5% percentage of packets
which caused a long tail
23.99 uS RTT
48.00 uS RTT
© - Informatix Solutions, 2010Page 19 Version 1.2
Informatix
Solutions
40/100G Ethernet v. 40G InfiniBand
• IEEE 802.3ba standard definition it is not expected to be agreed before
mid-2010
• Currently only carrier class switches are available. Need DC class switches
before deployable to the server
• Switch port cost is around $40K for 40G Ethernet
• Ethernet 10G Cisco Nexus port currently around $930
• 10G Ethernet dual port NIC ~$1000
• A 40G longline is currently 6x the cost of a 10G longline
• A 100G longline is currently 20x the cost of a 10G longline
• InfiniBand 40G standard is agreed, and can be configured for 120G by
using 12x
• 40G InfiniBand products for both switches and servers have been shipping
in volume since 2009
• InfiniBand 40G switch port already < $300 (36-port)
• InfiniBand 40G HCA dual port ~$850
19
Ethernet
InfiniBand
© - Informatix Solutions, 2010Page 20 Version 1.2
Informatix
Solutions
Swot Analysis from client Design Study
Solution Budgetary
Cost
Strengths Weakness
Cisco Catalyst
6500
€1,437K Widest installed
Proven technology
Risk adverse
Poor Bandwidth usage
Complexity of
configuration
Costly given provided
functionality
Same as everybody
else – no latency
advantage
Nortel ERS8600 € 919K Well proven
High B/W usage through
Active:Active L2 links
Simpler L2 management than
Cisco
Better POP scalability through
multipoint support (SW upgrade
in 2009)
Risk Neutral
Lowest cost solution
Different to Cisco so
small learning curve
InfiniBand €1,330K Lowest latency solution
High B/W usage through
Active:Active
First to deploy pan-Market low
latency solution in Europe
Includes 20gb/s server attach
providing additional application
performance benefits
Learning curve of new
technology
First installation in
Financial Services for
Europe, for this
distributed IB fabric
Could be considered
bleeding edge solution
and therefore highest
risk
Five sites using existing
long haul circuits.
Costs covered purchase of
all network equipment
and purchase of HCA’s for
InfiniBand option, using
vendor list prices.
Ethernet option only
provided 1Gb/s server
attach.
By the time we got to
deploy the InfiniBand
products had been
upgraded to 40G within
these budgetary estimates.
© - Informatix Solutions, 2010Page 21 Version 1.2
Informatix
Solutions
Comparison with Ethernet - summary
• Best effort delivery. Any device may drop
packets
• Relies on TCP/IP to correct any errors
• Subject to microbursts
• Store and forward. (cut-through usually
limited to local cluster)
• Standardization around compatible RDMA
NICs only now starting – need same NICs
are both ends
• Trunking is an add-on, multiple standards
an extensions
• Spanning Tree creates idle links
• Now adding congestion management for
FCoE but standards still devloping
• Carries legacy from it’s origins as a
CSMA/CD media
• Ethernet switches not as scalable as
InfiniBand
Provisioned port cost for 10Ge approx. 40%
higher than cost of 40G InfiniBand
• Guaranteed delivery. Credit based flow
control
• Hardware based re-transmission
• Dropped packets prevented by congestion
management
• Cut through design with late packet
invalidation
• RDMA baked into standard and proven by
interoperability testing
• Trunking is built into the architecture
• All links are used
• Must use QoS when sharing with different
applications
• Supports storage today
• Green field design which applied lessons
learnt from previous generation
interconnects.
• Legacy protocol support with IPoIB, SRP,
vNICs and vHBAs.
Ethernet InfiniBand
© - Informatix Solutions, 2010Page 22 Version 1.2
Informatix
Solutions
Related info
• Hedge by deploying the Mellanox VPI range of HCA’s. These
dual port (CX4) cards can be configured to run InfiniBand or
10G Ethernet. They support OFED on both, and RDMAoE.
HCA has drivers for Linux and Windows.
• See also:
• Serialization costs
• Multicast
• Ethernet to InfiniBand gateways
22

More Related Content

What's hot

TCP vs UDP / Sumiet23
TCP vs UDP / Sumiet23TCP vs UDP / Sumiet23
TCP vs UDP / Sumiet23
Sumiet Talekar
 
Overview of Spanning Tree Protocol (STP & RSTP)
Overview of Spanning Tree Protocol (STP & RSTP)Overview of Spanning Tree Protocol (STP & RSTP)
Overview of Spanning Tree Protocol (STP & RSTP)
Peter R. Egli
 
Overview of Spanning Tree Protocol
Overview of Spanning Tree ProtocolOverview of Spanning Tree Protocol
Overview of Spanning Tree Protocol
Arash Foroughi
 
InfiniBand Essentials Every HPC Expert Must Know
InfiniBand Essentials Every HPC Expert Must KnowInfiniBand Essentials Every HPC Expert Must Know
InfiniBand Essentials Every HPC Expert Must Know
Mellanox Technologies
 
les réseaux d'opérateurs
les réseaux d'opérateurs les réseaux d'opérateurs
les réseaux d'opérateurs
Abdeljalil BENIICHE
 
Storm-Control
Storm-ControlStorm-Control
Storm-Control
NetProtocol Xpert
 
bgp protocol
 bgp protocol bgp protocol
bgp protocol
Sukanya Sanyal
 
Bgp
BgpBgp
Mikrotik Load Balancing with PCC
Mikrotik Load Balancing with PCCMikrotik Load Balancing with PCC
Mikrotik Load Balancing with PCC
GLC Networks
 
Icmp
IcmpIcmp
CCNP Route EIGRP Overview
CCNP Route  EIGRP OverviewCCNP Route  EIGRP Overview
CCNP Route EIGRP Overview
Visalini Kumaraswamy
 
BGP protocol presentation
BGP protocol  presentationBGP protocol  presentation
BGP protocol presentation
Gorantla Mohanavamsi
 
Les réseaux informatiques 2
Les réseaux informatiques 2Les réseaux informatiques 2
Les réseaux informatiques 2
Zakariyaa AIT ELMOUDEN
 
HSRP (hot standby router protocol)
HSRP (hot standby router protocol)HSRP (hot standby router protocol)
HSRP (hot standby router protocol)
Netwax Lab
 
IS-IS Protocol Adjacency
IS-IS Protocol Adjacency IS-IS Protocol Adjacency
IS-IS Protocol Adjacency
NetProtocol Xpert
 
Ethernet
EthernetEthernet
Bgp
BgpBgp
IPV6 Implémentation Best Practices & Retours d'Expérience
IPV6 Implémentation Best Practices & Retours d'ExpérienceIPV6 Implémentation Best Practices & Retours d'Expérience
IPV6 Implémentation Best Practices & Retours d'Expérience
Microsoft Technet France
 
Border Gatway Protocol
Border Gatway ProtocolBorder Gatway Protocol
Border Gatway Protocol
Shashank Asthana
 
Ip multicast
Ip multicastIp multicast
Ip multicast
Ashutosh Pateriya
 

What's hot (20)

TCP vs UDP / Sumiet23
TCP vs UDP / Sumiet23TCP vs UDP / Sumiet23
TCP vs UDP / Sumiet23
 
Overview of Spanning Tree Protocol (STP & RSTP)
Overview of Spanning Tree Protocol (STP & RSTP)Overview of Spanning Tree Protocol (STP & RSTP)
Overview of Spanning Tree Protocol (STP & RSTP)
 
Overview of Spanning Tree Protocol
Overview of Spanning Tree ProtocolOverview of Spanning Tree Protocol
Overview of Spanning Tree Protocol
 
InfiniBand Essentials Every HPC Expert Must Know
InfiniBand Essentials Every HPC Expert Must KnowInfiniBand Essentials Every HPC Expert Must Know
InfiniBand Essentials Every HPC Expert Must Know
 
les réseaux d'opérateurs
les réseaux d'opérateurs les réseaux d'opérateurs
les réseaux d'opérateurs
 
Storm-Control
Storm-ControlStorm-Control
Storm-Control
 
bgp protocol
 bgp protocol bgp protocol
bgp protocol
 
Bgp
BgpBgp
Bgp
 
Mikrotik Load Balancing with PCC
Mikrotik Load Balancing with PCCMikrotik Load Balancing with PCC
Mikrotik Load Balancing with PCC
 
Icmp
IcmpIcmp
Icmp
 
CCNP Route EIGRP Overview
CCNP Route  EIGRP OverviewCCNP Route  EIGRP Overview
CCNP Route EIGRP Overview
 
BGP protocol presentation
BGP protocol  presentationBGP protocol  presentation
BGP protocol presentation
 
Les réseaux informatiques 2
Les réseaux informatiques 2Les réseaux informatiques 2
Les réseaux informatiques 2
 
HSRP (hot standby router protocol)
HSRP (hot standby router protocol)HSRP (hot standby router protocol)
HSRP (hot standby router protocol)
 
IS-IS Protocol Adjacency
IS-IS Protocol Adjacency IS-IS Protocol Adjacency
IS-IS Protocol Adjacency
 
Ethernet
EthernetEthernet
Ethernet
 
Bgp
BgpBgp
Bgp
 
IPV6 Implémentation Best Practices & Retours d'Expérience
IPV6 Implémentation Best Practices & Retours d'ExpérienceIPV6 Implémentation Best Practices & Retours d'Expérience
IPV6 Implémentation Best Practices & Retours d'Expérience
 
Border Gatway Protocol
Border Gatway ProtocolBorder Gatway Protocol
Border Gatway Protocol
 
Ip multicast
Ip multicastIp multicast
Ip multicast
 

Similar to Ethernetv infiniband

100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdf
JunZhao68
 
Infini Band
Infini BandInfini Band
Infini Band
Pulkit Shah
 
SAN and FICON Long Distance Connectivity
SAN and FICON Long Distance ConnectivitySAN and FICON Long Distance Connectivity
SAN and FICON Long Distance Connectivity
ADVA
 
Mellanox Approach to NFV & SDN
Mellanox Approach to NFV & SDNMellanox Approach to NFV & SDN
Mellanox Approach to NFV & SDN
Mellanox Technologies
 
High performance browser networking ch1,2,3
High performance browser networking ch1,2,3High performance browser networking ch1,2,3
High performance browser networking ch1,2,3
Seung-Bum Lee
 
PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...
PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...
PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...
PROIDEA
 
10 gigabit ethernet technology presentation
10 gigabit ethernet technology presentation10 gigabit ethernet technology presentation
10 gigabit ethernet technology presentation
Ibrahim Kazanci
 
To Infiniband and Beyond
To Infiniband and BeyondTo Infiniband and Beyond
To Infiniband and Beyond
Boston Consulting Group
 
Why 10 Gigabit Ethernet Draft v2
Why 10 Gigabit Ethernet Draft v2Why 10 Gigabit Ethernet Draft v2
Why 10 Gigabit Ethernet Draft v2
Vijay Tolani
 
Gepon 2FONET Presentation
Gepon 2FONET PresentationGepon 2FONET Presentation
Gepon 2FONET Presentation
ScorpAL
 
ITN3052_04_Switched_Networks.pdf
ITN3052_04_Switched_Networks.pdfITN3052_04_Switched_Networks.pdf
ITN3052_04_Switched_Networks.pdf
ssuser2d7235
 
16 f31a0408
16 f31a040816 f31a0408
16 f31a0408
jayavardhan reddy
 
Honeypot Farms using Ethernet Bridging over a TCP Connection
Honeypot Farms using Ethernet Bridging over a TCP Connection Honeypot Farms using Ethernet Bridging over a TCP Connection
Honeypot Farms using Ethernet Bridging over a TCP Connection
morisson
 
LAWIN: a Latency-AWare InterNet Architecture for Latency Support on Best-Effo...
LAWIN: a Latency-AWare InterNet Architecture for Latency Support on Best-Effo...LAWIN: a Latency-AWare InterNet Architecture for Latency Support on Best-Effo...
LAWIN: a Latency-AWare InterNet Architecture for Latency Support on Best-Effo...
Katsushi Kobayashi
 
Infiniband and Ethernet
Infiniband and EthernetInfiniband and Ethernet
Infiniband and Ethernet
Farkhanda Kiran
 
Packet light short1
Packet light short1Packet light short1
Packet light short1
Kurt Rahrig
 
5 IEEE standards
5  IEEE standards5  IEEE standards
5 IEEE standards
Rodgers Moonde
 
How networks are build
How networks are buildHow networks are build
How networks are build
Mike Siowa
 
10 Gigabit Ethernet Technology - old
10 Gigabit Ethernet Technology - old10 Gigabit Ethernet Technology - old
10 Gigabit Ethernet Technology - old
Ibrahim Kazanci
 
Week11
Week11Week11
Week11
guest79a91d
 

Similar to Ethernetv infiniband (20)

100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdf
 
Infini Band
Infini BandInfini Band
Infini Band
 
SAN and FICON Long Distance Connectivity
SAN and FICON Long Distance ConnectivitySAN and FICON Long Distance Connectivity
SAN and FICON Long Distance Connectivity
 
Mellanox Approach to NFV & SDN
Mellanox Approach to NFV & SDNMellanox Approach to NFV & SDN
Mellanox Approach to NFV & SDN
 
High performance browser networking ch1,2,3
High performance browser networking ch1,2,3High performance browser networking ch1,2,3
High performance browser networking ch1,2,3
 
PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...
PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...
PLNOG 13: Alexis Dacquay: Handling high-bandwidth-consumption applications in...
 
10 gigabit ethernet technology presentation
10 gigabit ethernet technology presentation10 gigabit ethernet technology presentation
10 gigabit ethernet technology presentation
 
To Infiniband and Beyond
To Infiniband and BeyondTo Infiniband and Beyond
To Infiniband and Beyond
 
Why 10 Gigabit Ethernet Draft v2
Why 10 Gigabit Ethernet Draft v2Why 10 Gigabit Ethernet Draft v2
Why 10 Gigabit Ethernet Draft v2
 
Gepon 2FONET Presentation
Gepon 2FONET PresentationGepon 2FONET Presentation
Gepon 2FONET Presentation
 
ITN3052_04_Switched_Networks.pdf
ITN3052_04_Switched_Networks.pdfITN3052_04_Switched_Networks.pdf
ITN3052_04_Switched_Networks.pdf
 
16 f31a0408
16 f31a040816 f31a0408
16 f31a0408
 
Honeypot Farms using Ethernet Bridging over a TCP Connection
Honeypot Farms using Ethernet Bridging over a TCP Connection Honeypot Farms using Ethernet Bridging over a TCP Connection
Honeypot Farms using Ethernet Bridging over a TCP Connection
 
LAWIN: a Latency-AWare InterNet Architecture for Latency Support on Best-Effo...
LAWIN: a Latency-AWare InterNet Architecture for Latency Support on Best-Effo...LAWIN: a Latency-AWare InterNet Architecture for Latency Support on Best-Effo...
LAWIN: a Latency-AWare InterNet Architecture for Latency Support on Best-Effo...
 
Infiniband and Ethernet
Infiniband and EthernetInfiniband and Ethernet
Infiniband and Ethernet
 
Packet light short1
Packet light short1Packet light short1
Packet light short1
 
5 IEEE standards
5  IEEE standards5  IEEE standards
5 IEEE standards
 
How networks are build
How networks are buildHow networks are build
How networks are build
 
10 Gigabit Ethernet Technology - old
10 Gigabit Ethernet Technology - old10 Gigabit Ethernet Technology - old
10 Gigabit Ethernet Technology - old
 
Week11
Week11Week11
Week11
 

More from Mason Mei

Brkdcn 2035 multi-x
Brkdcn 2035 multi-xBrkdcn 2035 multi-x
Brkdcn 2035 multi-x
Mason Mei
 
Ovn vancouver
Ovn vancouverOvn vancouver
Ovn vancouver
Mason Mei
 
11 zhuai chuanpu h3 c
11 zhuai chuanpu h3 c11 zhuai chuanpu h3 c
11 zhuai chuanpu h3 c
Mason Mei
 
10 2013 sdn summit ch reviewed-new
10 2013 sdn summit ch reviewed-new10 2013 sdn summit ch reviewed-new
10 2013 sdn summit ch reviewed-new
Mason Mei
 
08 sdn system intelligence short public beijing sdn conference - 130828
08 sdn system intelligence   short public beijing sdn conference - 13082808 sdn system intelligence   short public beijing sdn conference - 130828
08 sdn system intelligence short public beijing sdn conference - 130828
Mason Mei
 
07 tang xiongyan
07 tang xiongyan07 tang xiongyan
07 tang xiongyan
Mason Mei
 
06 duan xiaodong
06 duan xiaodong06 duan xiaodong
06 duan xiaodong
Mason Mei
 
05 zhao huiling
05 zhao huiling05 zhao huiling
05 zhao huiling
Mason Mei
 
04 hou ziqiang
04 hou ziqiang04 hou ziqiang
04 hou ziqiang
Mason Mei
 
03 jiang lintao
03 jiang lintao03 jiang lintao
03 jiang lintao
Mason Mei
 
02 china sdn conf ron keynote
02 china sdn conf ron keynote02 china sdn conf ron keynote
02 china sdn conf ron keynote
Mason Mei
 
01 dan chinese-chinese sdn china 2013- dan's keynote draft aug 14 2013
01 dan chinese-chinese sdn china 2013- dan's keynote draft aug 14 201301 dan chinese-chinese sdn china 2013- dan's keynote draft aug 14 2013
01 dan chinese-chinese sdn china 2013- dan's keynote draft aug 14 2013
Mason Mei
 
New idc architecture
New idc architectureNew idc architecture
New idc architecture
Mason Mei
 
H3 cswitch2015
H3 cswitch2015H3 cswitch2015
H3 cswitch2015
Mason Mei
 
201507131408448146
201507131408448146201507131408448146
201507131408448146
Mason Mei
 
16 vxlan配置指导-整本手册
16 vxlan配置指导-整本手册16 vxlan配置指导-整本手册
16 vxlan配置指导-整本手册
Mason Mei
 
Atf 3 q15-8 - introducing macro-segementation
Atf 3 q15-8 - introducing macro-segementationAtf 3 q15-8 - introducing macro-segementation
Atf 3 q15-8 - introducing macro-segementation
Mason Mei
 
Atf 3 q15-9 - summary and close
Atf 3 q15-9 - summary and closeAtf 3 q15-9 - summary and close
Atf 3 q15-9 - summary and close
Mason Mei
 
Atf 3 q15-7 - delivering cloud scale workflow automation control and visibili...
Atf 3 q15-7 - delivering cloud scale workflow automation control and visibili...Atf 3 q15-7 - delivering cloud scale workflow automation control and visibili...
Atf 3 q15-7 - delivering cloud scale workflow automation control and visibili...
Mason Mei
 
Atf 3 q15-4 - scaling the the software driven cloud network
Atf 3 q15-4 - scaling the the software driven cloud networkAtf 3 q15-4 - scaling the the software driven cloud network
Atf 3 q15-4 - scaling the the software driven cloud network
Mason Mei
 

More from Mason Mei (20)

Brkdcn 2035 multi-x
Brkdcn 2035 multi-xBrkdcn 2035 multi-x
Brkdcn 2035 multi-x
 
Ovn vancouver
Ovn vancouverOvn vancouver
Ovn vancouver
 
11 zhuai chuanpu h3 c
11 zhuai chuanpu h3 c11 zhuai chuanpu h3 c
11 zhuai chuanpu h3 c
 
10 2013 sdn summit ch reviewed-new
10 2013 sdn summit ch reviewed-new10 2013 sdn summit ch reviewed-new
10 2013 sdn summit ch reviewed-new
 
08 sdn system intelligence short public beijing sdn conference - 130828
08 sdn system intelligence   short public beijing sdn conference - 13082808 sdn system intelligence   short public beijing sdn conference - 130828
08 sdn system intelligence short public beijing sdn conference - 130828
 
07 tang xiongyan
07 tang xiongyan07 tang xiongyan
07 tang xiongyan
 
06 duan xiaodong
06 duan xiaodong06 duan xiaodong
06 duan xiaodong
 
05 zhao huiling
05 zhao huiling05 zhao huiling
05 zhao huiling
 
04 hou ziqiang
04 hou ziqiang04 hou ziqiang
04 hou ziqiang
 
03 jiang lintao
03 jiang lintao03 jiang lintao
03 jiang lintao
 
02 china sdn conf ron keynote
02 china sdn conf ron keynote02 china sdn conf ron keynote
02 china sdn conf ron keynote
 
01 dan chinese-chinese sdn china 2013- dan's keynote draft aug 14 2013
01 dan chinese-chinese sdn china 2013- dan's keynote draft aug 14 201301 dan chinese-chinese sdn china 2013- dan's keynote draft aug 14 2013
01 dan chinese-chinese sdn china 2013- dan's keynote draft aug 14 2013
 
New idc architecture
New idc architectureNew idc architecture
New idc architecture
 
H3 cswitch2015
H3 cswitch2015H3 cswitch2015
H3 cswitch2015
 
201507131408448146
201507131408448146201507131408448146
201507131408448146
 
16 vxlan配置指导-整本手册
16 vxlan配置指导-整本手册16 vxlan配置指导-整本手册
16 vxlan配置指导-整本手册
 
Atf 3 q15-8 - introducing macro-segementation
Atf 3 q15-8 - introducing macro-segementationAtf 3 q15-8 - introducing macro-segementation
Atf 3 q15-8 - introducing macro-segementation
 
Atf 3 q15-9 - summary and close
Atf 3 q15-9 - summary and closeAtf 3 q15-9 - summary and close
Atf 3 q15-9 - summary and close
 
Atf 3 q15-7 - delivering cloud scale workflow automation control and visibili...
Atf 3 q15-7 - delivering cloud scale workflow automation control and visibili...Atf 3 q15-7 - delivering cloud scale workflow automation control and visibili...
Atf 3 q15-7 - delivering cloud scale workflow automation control and visibili...
 
Atf 3 q15-4 - scaling the the software driven cloud network
Atf 3 q15-4 - scaling the the software driven cloud networkAtf 3 q15-4 - scaling the the software driven cloud network
Atf 3 q15-4 - scaling the the software driven cloud network
 

Recently uploaded

National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 

Recently uploaded (20)

National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 

Ethernetv infiniband

  • 1. © - Informatix Solutions, 2010Page 1 Version 1.2 Informatix Solutions ETHERNET V. INFINIBAND
  • 2. © - Informatix Solutions, 2010Page 2 Version 1.2 Informatix Solutions Reliable Transport • InfiniBand uses hardware based retransmission • InfiniBand uses both link level and end-to-end CRC’s • Ethernet is a best efforts delivery, allowed to drop packets and relies on the TCP/IP protocol for reliability which is typically implemented in SW for retransmission • The effort to implement TCP/IP in hardware has been proven much more challenging than what people imagined. TCPoffloads cards have not been very successful and have not been shown to lower latency • TCP/IP is the major performance bottleneck for bandwidths of 10G and above • InfiniBand delivers reliability at the hardware level providing higher throughput, less latency and rarely causes jitter. This enables the use without TCP/IP 2
  • 3. © - Informatix Solutions, 2010Page 3 Version 1.2 Informatix Solutions Flow Control • InfiniBand uses credit based flow control for each link which means that InfiniBand switch chips can be built with much smaller on-chip buffers than Ethernet • Ethernet switches rely on explicit packet drops for flow control, which requires larger buffers because the cost of retransmission is very high. • This technical difference enables the building larger and lower cost switch chips for InfiniBand vs. Ethernet • This has resulted in larger, higher density, lower contention, lower cost InfiniBand switches with lower cost per port than their 10Ge equivalents • maximum 40G InfiniBand, zero contention port density is 648 ports • maximum 10G Ethernet, zero contention port density is 384 ports 3
  • 4. © - Informatix Solutions, 2010Page 4 Version 1.2 Informatix Solutions Switch performance • InfiniBand has late packet invalidation which enables cut- through switching for low latency non-blocking performance spanning the entire fabric • Virtually all Ethernet switches are designed for L3/L4 switching, which requires packet rewrite, and which requires a store-forward architecture • Where Ethernet supports cut-through, reliable deployment is limited to local clusters and small subnets due to the need to prevent propagation of invalid packets. The latency impact of store forward is quite significant: for a 500Byte packet at 1 Gbps it is 5.7 usec, for a 1500 Byte packet at 1 Gbps it is 16.3 usec Store and forward adds this overhead for every hop! 4
  • 5. © - Informatix Solutions, 2010Page 5 Version 1.2 Informatix Solutions Congestion Management • InfiniBand has end to end Congestion management as part of the existing standard (IBA 1.2) • Congestion detection at receiver sends notification messages to sender to reduce rate • Policies determine recovery rates • Ethernet relies on TCP • Issues with current TCP congestion management algorithms on high bandwidth, long haul circuits limit single session throughput due to the window sizing halving for each congestion event
  • 6. © - Informatix Solutions, 2010Page 6 Version 1.2 Informatix Solutions TCP/IP recovery for congestion events Courtesy of Obsidian Research
  • 7. © - Informatix Solutions, 2010Page 7 Version 1.2 Informatix Solutions Management • InfiniBand includes a complete management stack which provides high levels of configurability and observability. • Ethernet relies on TCP/IP to correct errors • InfiniBand enables an end-to-end solution to be deployed and run reliably without the overhead of TCP-IP • Ethernet relies on add-ons such as trunking and spanning tree to add resiliency into Layer 2 networks. • Ethernet Spanning tree is active:standby and takes seconds to switch. This switching causes multicast flooding in most switches. • InfiniBand preselects failover paths and switches almost instantly 7
  • 8. © - Informatix Solutions, 2010Page 8 Version 1.2 Informatix Solutions The challenges of reliable networks • Reliable delivery has a downside. The packets have to be saved somewhere before they can be delivered • InfiniBand transfers this buffering from the TCP socket on the server to the network • All Credit-based networks suffer from congestion • When receivers are not fast enough, packets build up in the network. This backpressure slows the sender but causes congestion on shared links potentially impacting other nodes or applications • The Ethernet Pause feature has similar impact and is how FCoE achieves reliable delivery • InfiniBand uses independent Tx/Rx buffers for each Virtual Lane to mitigate this impact • Requires careful QoS design to minimise the onset of congestion and utilize the VL’s
  • 9. © - Informatix Solutions, 2010Page 9 Version 1.2 Informatix Solutions The challenges of mesh networks • Mesh networks present multiple active paths to link two nodes • Spanning tree solves this by pruning the mesh and reducing it to one active path • Applications, particularly those using RDMA require in-order delivery. This can only be achieved by having a fixed path between two nodes • This constrains bandwidth usage and requires more sophisticated path selection algorithms with load balancing • Reconfiguration events can result in looping packets consuming all bandwidth • Topology design and path selections has to address potential loops
  • 10. © - Informatix Solutions, 2010Page 10 Version 1.2 Informatix Solutions RDMA • RDMA is part of the standard with InfiniBand. It is a mandatory requirement and has been extensively interoperability tested. This allows multivendor configurations to be safely deployed • RDMA on Ethernet mostly requires a matched pair of NIC’s. • iWARP added the (latency) cost of TCP to overcome Ethernets reliability problems • RDMAoEthernet is an emerging standard and relies on FCoE reliable delivery to avoid the need for a TCP layer. This then requires a Convergence Enhanced Ethernet NIC to be fully deployed end-to-end for RDMA • InfiniBand RDMA, written by OFED, is standard in Linux 2.6.14 and later. Torvold will not permit another RDMA implementation in the stock kernel. Ethernet manufacturers are slowly adding OFED support to their cards 10
  • 11. © - Informatix Solutions, 2010Page 11 Version 1.2 Informatix Solutions Cabling • Promise of 10GbaseT has always been held out to lower Ethernet prices. • Technical challenges of running 10G over RG45 has been immense • Requires 6W to drive • Needs Cat-6A or Cat-7 cable so will rarely run over existing infrastructure • Uses a block level error detection scheme. Requires full block to be loaded. Adds 2µS to every hop • Few vendors support 10GbaseT for these reasons • SFP+ is most popular 10G option. • Small form factor gives same packing density as RG45 • 1W and 100nS latency • Available in both Cu and Optical (LC format) • Comparable to QFSP used by InfiniBand (and 40G Ethernet)
  • 12. © - Informatix Solutions, 2010Page 12 Version 1.2 Informatix Solutions Long Distance • InfiniBand is commonly viewed as fit for local clusters only. This is incorrect and was caused by the fat and short copper cables. • InfiniBand and Ethernet share the same cabling at 40G an above. At 10G cables are similar (SFP+ v. QFSP) and have similar physical constraints • InfiniBand Fibre cables are available up to 4km • The higher scalability of InfiniBand subnets (48K ports) means that remote sites can be safely bridged without incurring the penalties of routing delays • Long distance InfiniBand switches provide the necessary packet buffering to support distances of thousands of miles - e.g. US DoD coast-to-coast InfiniBand fabric 12
  • 13. © - Informatix Solutions, 2010Page 13 Version 1.2 Informatix Solutions Advantages of InfiniBand over Ethernet • Cut through design with hardware generated link and end2end CRC's and late packet invalidation • Avoids packet buffering required by Ethernet • 5uS compared with 20uS Ethernet latency • Implicit layer 2 trunking, bundles of 1,4,12 physical links into a single “logical” channel. Handled transparently by the hardware • Ethernet trunking is vendor option, implemented in NIC driver rather than hardware. • Ethernet confused by competing standards e.g. Cisco Etherchannel • Ethernet does not stripe an individual packet whilst InfiniBand does • Standardized RDMA to lower CPU overhead • Ethernet currently has vendor specific RDMA, requires matched pairs of cards and device driver support. Effort to standardize ongoing in Ethernet community • Legacy Ethernet protocol constrains large switch implementation – max possible today: • InfiniBand 648 port zero contention 40Gbps, 3052 ports at 20Gbps, no contention. • Cisco Nexus 7000 32 port 10G blade with a total of 8 per chassis (256 ports) but limited to 80G fabric i.e. 8:1 contention. Still not shipping.
  • 14. © - Informatix Solutions, 2010Page 14 Version 1.2 Informatix Solutions Ethernet Packet Drops and discards User Kernel context switch Socket buffers TCP/IP NIC Buffers App buffer TOR switch buffers Router buffers Server 1 Server 2 TOR Switch 1:2.4 contention Core Routers 1:10 contention Queued using flow control Discarded when full Error Detection and Correction By TCP User Prog User Prog User Prog User Prog User Kernel context switch Socket buffers TCP/IP NIC Buffers
  • 15. © - Informatix Solutions, 2010Page 16 Version 1.2 Informatix Solutions Impact on latency and jitter Whilst latency savings are only small for 10ge v. InfiniBand, their is a big advantage with less jitter for InfiniBand compared with Ethernet Diagram courtesy of STAC research
  • 16. © - Informatix Solutions, 2010Page 17 Version 1.2 Informatix Solutions Multicast Latency RTT in microseconds Percentage distribution NanosecondRTT MC loopback onto new group MC sender and MC receiver 10ge was Cisco Nexus 5010 - 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 500,000 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 99 1ge 2.6.30.1 10ge RH5.3 Nexus IB 2.6.27.10 IB 2.6.30.1 IB Verbs 2.6.30.1 Interface BW Min (µS) Avg (µs) Median (µS) Max (µS) STD (µS) Ethernet 1G 41.64 92.14 92.38 151.69 23.65 Ethernet 10G 45.00 65.48 48.00 3485.0 172.2 IPoIB (bonded) 80G 32.11 51.28 52.64 199.20 4.30 IPoIB 40G 23.22 24.86 23.99 1582 27.29 IB VERB 40G 5.79 6.21 6.14 46.90 0.51 Testing carried out at Intel fasterLAB
  • 17. © - Informatix Solutions, 2010Page 18 Version 1.2 Informatix Solutions InfiniBand v. Ethernet Latency Distributions 18 0 1000 2000 3000 4000 5000 6000 7000 8000 45.00 64.35 83.70 103.05 122.40 141.75 161.10 180.45 199.80 219.15 238.50 257.85 277.20 296.55 315.90 335.25 354.60 373.95 393.30 412.65 434.00 2.6.18 10ge Nexus micro second RTT nano second RTT 0 500 1000 1500 2000 2500 3000 3500 4000 4500 23.217 23.372 23.527 23.682 23.837 23.992 24.147 24.302 24.457 24.612 24.767 24.922 25.077 25.232 25.387 25.542 25.697 25.852 26.007 26.162 26.369 2.6.30IB #ofsamples #ofsamples InfiniBand Normalised time axis •InfiniBand may have derived some benefit from newer kernel •Nexus suffered a jitter for 5% percentage of packets which caused a long tail 23.99 uS RTT 48.00 uS RTT
  • 18. © - Informatix Solutions, 2010Page 19 Version 1.2 Informatix Solutions 40/100G Ethernet v. 40G InfiniBand • IEEE 802.3ba standard definition it is not expected to be agreed before mid-2010 • Currently only carrier class switches are available. Need DC class switches before deployable to the server • Switch port cost is around $40K for 40G Ethernet • Ethernet 10G Cisco Nexus port currently around $930 • 10G Ethernet dual port NIC ~$1000 • A 40G longline is currently 6x the cost of a 10G longline • A 100G longline is currently 20x the cost of a 10G longline • InfiniBand 40G standard is agreed, and can be configured for 120G by using 12x • 40G InfiniBand products for both switches and servers have been shipping in volume since 2009 • InfiniBand 40G switch port already < $300 (36-port) • InfiniBand 40G HCA dual port ~$850 19 Ethernet InfiniBand
  • 19. © - Informatix Solutions, 2010Page 20 Version 1.2 Informatix Solutions Swot Analysis from client Design Study Solution Budgetary Cost Strengths Weakness Cisco Catalyst 6500 €1,437K Widest installed Proven technology Risk adverse Poor Bandwidth usage Complexity of configuration Costly given provided functionality Same as everybody else – no latency advantage Nortel ERS8600 € 919K Well proven High B/W usage through Active:Active L2 links Simpler L2 management than Cisco Better POP scalability through multipoint support (SW upgrade in 2009) Risk Neutral Lowest cost solution Different to Cisco so small learning curve InfiniBand €1,330K Lowest latency solution High B/W usage through Active:Active First to deploy pan-Market low latency solution in Europe Includes 20gb/s server attach providing additional application performance benefits Learning curve of new technology First installation in Financial Services for Europe, for this distributed IB fabric Could be considered bleeding edge solution and therefore highest risk Five sites using existing long haul circuits. Costs covered purchase of all network equipment and purchase of HCA’s for InfiniBand option, using vendor list prices. Ethernet option only provided 1Gb/s server attach. By the time we got to deploy the InfiniBand products had been upgraded to 40G within these budgetary estimates.
  • 20. © - Informatix Solutions, 2010Page 21 Version 1.2 Informatix Solutions Comparison with Ethernet - summary • Best effort delivery. Any device may drop packets • Relies on TCP/IP to correct any errors • Subject to microbursts • Store and forward. (cut-through usually limited to local cluster) • Standardization around compatible RDMA NICs only now starting – need same NICs are both ends • Trunking is an add-on, multiple standards an extensions • Spanning Tree creates idle links • Now adding congestion management for FCoE but standards still devloping • Carries legacy from it’s origins as a CSMA/CD media • Ethernet switches not as scalable as InfiniBand Provisioned port cost for 10Ge approx. 40% higher than cost of 40G InfiniBand • Guaranteed delivery. Credit based flow control • Hardware based re-transmission • Dropped packets prevented by congestion management • Cut through design with late packet invalidation • RDMA baked into standard and proven by interoperability testing • Trunking is built into the architecture • All links are used • Must use QoS when sharing with different applications • Supports storage today • Green field design which applied lessons learnt from previous generation interconnects. • Legacy protocol support with IPoIB, SRP, vNICs and vHBAs. Ethernet InfiniBand
  • 21. © - Informatix Solutions, 2010Page 22 Version 1.2 Informatix Solutions Related info • Hedge by deploying the Mellanox VPI range of HCA’s. These dual port (CX4) cards can be configured to run InfiniBand or 10G Ethernet. They support OFED on both, and RDMAoE. HCA has drivers for Linux and Windows. • See also: • Serialization costs • Multicast • Ethernet to InfiniBand gateways 22