SlideShare a Scribd company logo
Designing Cloud and Grid Computing Systems
with InfiniBand and High-Speed Ethernet
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: panda@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
A Tutorial at CCGrid ’11
by
Sayantan Sur
The Ohio State University
E-mail: surs@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~surs
• Introduction
• Why InfiniBand and High-speed Ethernet?
• Overview of IB, HSE, their Convergence and Features
• IB and HSE HW/SW Products and Installations
• Sample Case Studies and Performance Numbers
• Conclusions and Final Q&A
CCGrid '11
Presentation Overview
2
CCGrid '11
Current and Next Generation Applications and
Computing Systems
3
• Diverse Range of Applications
– Processing and dataset characteristics vary
• Growth of High Performance Computing
– Growth in processor performance
• Chip density doubles every 18 months
– Growth in commodity networking
• Increase in speed/features + reducing cost
• Different Kinds of Systems
– Clusters, Grid, Cloud, Datacenters, …..
CCGrid '11
Cluster Computing Environment
Compute cluster
LANFrontend
Meta-Data
Manager
I/O Server
Node
Meta
Data
Data
Compute
Node
Compute
Node
I/O Server
Node Data
Compute
Node
I/O Server
Node Data
Compute
Node
L
A
N
Storage cluster
LAN
4
CCGrid '11
Trends for Computing Clusters in the Top 500 List
(http://www.top500.org)
Nov. 1996: 0/500 (0%) Nov. 2001: 43/500 (8.6%) Nov. 2006: 361/500 (72.2%)
Jun. 1997: 1/500 (0.2%) Jun. 2002: 80/500 (16%) Jun. 2007: 373/500 (74.6%)
Nov. 1997: 1/500 (0.2%) Nov. 2002: 93/500 (18.6%) Nov. 2007: 406/500 (81.2%)
Jun. 1998: 1/500 (0.2%) Jun. 2003: 149/500 (29.8%) Jun. 2008: 400/500 (80.0%)
Nov. 1998: 2/500 (0.4%) Nov. 2003: 208/500 (41.6%) Nov. 2008: 410/500 (82.0%)
Jun. 1999: 6/500 (1.2%) Jun. 2004: 291/500 (58.2%) Jun. 2009: 410/500 (82.0%)
Nov. 1999: 7/500 (1.4%) Nov. 2004: 294/500 (58.8%) Nov. 2009: 417/500 (83.4%)
Jun. 2000: 11/500 (2.2%) Jun. 2005: 304/500 (60.8%) Jun. 2010: 424/500 (84.8%)
Nov. 2000: 28/500 (5.6%) Nov. 2005: 360/500 (72.0%) Nov. 2010: 415/500 (83%)
Jun. 2001: 33/500 (6.6%) Jun. 2006: 364/500 (72.8%) Jun. 2011: To be announced
5
CCGrid '11
Grid Computing Environment
6
Compute cluster
LAN
Frontend
Meta-Data
Manager
I/O Server
Node
Meta
Data
Data
Compute
Node
Compute
Node
I/O Server
Node Data
Compute
Node
I/O Server
Node Data
Compute
Node
L
A
N
Storage cluster
LAN
Compute cluster
LAN
Frontend
Meta-Data
Manager
I/O Server
Node
Meta
Data
Data
Compute
Node
Compute
Node
I/O Server
Node Data
Compute
Node
I/O Server
Node Data
Compute
Node
L
A
N
Storage cluster
LAN
Compute cluster
LAN
Frontend
Meta-Data
Manager
I/O Server
Node
Meta
Data
Data
Compute
Node
Compute
Node
I/O Server
Node Data
Compute
Node
I/O Server
Node Data
Compute
Node
L
A
N
Storage cluster
LAN
WAN
CCGrid '11
Multi-Tier Datacenters and Enterprise Computing
7
.
.
.
.
Enterprise Multi-tier Datacenter
Tier1 Tier3
Routers/
Servers
Switch
Database
Server
Application
Server
Routers/
Servers
Routers/
Servers
Application
Server
Application
Server
Application
Server
Database
Server
Database
Server
Database
Server
Switch Switch
Routers/
Servers
Tier2
CCGrid '11
Integrated High-End Computing Environments
Compute cluster
Meta-Data
Manager
I/O Server
Node
Meta
Data
Data
Compute
Node
Compute
Node
I/O Server
Node Data
Compute
Node
I/O Server
Node Data
Compute
Node
L
A
NLANFrontend
Storage cluster
LAN/WAN
.
.
.
.
Enterprise Multi-tier Datacenter for Visualization and Mining
Tier1 Tier3
Routers/
Servers
Switch
Database
Server
Application
Server
Routers/
Servers
Routers/
Servers
Application
Server
Application
Server
Application
Server
Database
Server
Database
Server
Database
Server
Switch Switch
Routers/
Servers
Tier2
8
CCGrid '11
Cloud Computing Environments
9
LAN
Physical Machine
VM VM
Physical Machine
VM VM
Physical Machine
VM VM
VirtualFS
Meta-Data
Meta
Data
I/O Server Data
I/O Server Data
I/O Server Data
I/O Server Data
Physical Machine
VM VM
Local Storage
Local Storage
Local Storage
Local Storage
Hadoop Architecture
• Underlying Hadoop Distributed
File System (HDFS)
• Fault-tolerance by replicating
data blocks
• NameNode: stores information
on data blocks
• DataNodes: store blocks and
host Map-reduce computation
• JobTracker: track jobs and
detect failure
• Model scales but high amount
of communication during
intermediate phases
10CCGrid '11
Memcached Architecture
• Distributed Caching Layer
– Allows to aggregate spare memory from multiple nodes
– General purpose
• Typically used to cache database queries, results of API calls
• Scalable model, but typical usage very network intensive
11
Internet
Web-frontend
Servers
Memcached
Servers
Database
Servers
System
Area
Network
System
Area
Network
CCGrid '11
• Good System Area Networks with excellent performance
(low latency, high bandwidth and low CPU utilization) for
inter-processor communication (IPC) and I/O
• Good Storage Area Networks high performance I/O
• Good WAN connectivity in addition to intra-cluster
SAN/LAN connectivity
• Quality of Service (QoS) for interactive applications
• RAS (Reliability, Availability, and Serviceability)
• With low cost
CCGrid '11
Networking and I/O Requirements
12
• Hardware components
– Processing cores and memory
subsystem
– I/O bus or links
– Network adapters/switches
• Software components
– Communication stack
• Bottlenecks can artificially
limit the network performance
the user perceives
CCGrid '11 13
Major Components in Computing Systems
P0
Core0 Core1
Core2 Core3
P1
Core0 Core1
Core2 Core3
Memory
MemoryI
/
O
B
u
s
Network Adapter
Network
Switch
Processing
Bottlenecks
I/O Interface
Bottlenecks
Network
Bottlenecks
• Ex: TCP/IP, UDP/IP
• Generic architecture for all networks
• Host processor handles almost all aspects
of communication
– Data buffering (copies on sender and
receiver)
– Data integrity (checksum)
– Routing aspects (IP routing)
• Signaling between different layers
– Hardware interrupt on packet arrival or
transmission
– Software signals between different layers
to handle protocol processing in different
priority levels
CCGrid '11
Processing Bottlenecks in Traditional Protocols
14
P0
Core0 Core1
Core2 Core3
P1
Core0 Core1
Core2 Core3
Memory
MemoryI
/
O
B
u
s
Network Adapter
Network
Switch
Processing
Bottlenecks
• Traditionally relied on bus-based
technologies (last mile bottleneck)
– E.g., PCI, PCI-X
– One bit per wire
– Performance increase through:
• Increasing clock speed
• Increasing bus width
– Not scalable:
• Cross talk between bits
• Skew between wires
• Signal integrity makes it difficult to increase bus
width significantly, especially for high clock speeds
CCGrid '11
Bottlenecks in Traditional I/O Interfaces and Networks
15
PCI 1990 33MHz/32bit: 1.05Gbps (shared bidirectional)
PCI-X
1998 (v1.0)
2003 (v2.0)
133MHz/64bit: 8.5Gbps (shared bidirectional)
266-533MHz/64bit: 17Gbps (shared bidirectional)
P0
Core0 Core1
Core2 Core3
P1
Core0 Core1
Core2 Core3
Memory
MemoryI
/
O
B
u
s
Network Adapter
Network
Switch
I/O Interface
Bottlenecks
• Network speeds saturated at
around 1Gbps
– Features provided were limited
– Commodity networks were not
considered scalable enough for very
large-scale systems
CCGrid '11 16
Bottlenecks on Traditional Networks
Ethernet (1979 - ) 10 Mbit/sec
Fast Ethernet (1993 -) 100 Mbit/sec
Gigabit Ethernet (1995 -) 1000 Mbit /sec
ATM (1995 -) 155/622/1024 Mbit/sec
Myrinet (1993 -) 1 Gbit/sec
Fibre Channel (1994 -) 1 Gbit/sec
P0
Core0 Core1
Core2 Core3
P1
Core0 Core1
Core2 Core3
Memory
MemoryI
/
O
B
u
s
Network Adapter
Network
Switch
Network
Bottlenecks
• Industry Networking Standards
• InfiniBand and High-speed Ethernet were introduced into
the market to address these bottlenecks
• InfiniBand aimed at all three bottlenecks (protocol
processing, I/O bus, and network speed)
• Ethernet aimed at directly handling the network speed
bottleneck and relying on complementary technologies to
alleviate the protocol processing and I/O bus bottlenecks
CCGrid '11 17
Motivation for InfiniBand and High-speed Ethernet
• Introduction
• Why InfiniBand and High-speed Ethernet?
• Overview of IB, HSE, their Convergence and Features
• IB and HSE HW/SW Products and Installations
• Sample Case Studies and Performance Numbers
• Conclusions and Final Q&A
CCGrid '11
Presentation Overview
18
• IB Trade Association was formed with seven industry leaders
(Compaq, Dell, HP, IBM, Intel, Microsoft, and Sun)
• Goal: To design a scalable and high performance communication
and I/O architecture by taking an integrated view of computing,
networking, and storage technologies
• Many other industry participated in the effort to define the IB
architecture specification
• IB Architecture (Volume 1, Version 1.0) was released to public
on Oct 24, 2000
– Latest version 1.2.1 released January 2008
• http://www.infinibandta.org
CCGrid '11
IB Trade Association
19
• 10GE Alliance formed by several industry leaders to take
the Ethernet family to the next speed step
• Goal: To achieve a scalable and high performance
communication architecture while maintaining backward
compatibility with Ethernet
• http://www.ethernetalliance.org
• 40-Gbps (Servers) and 100-Gbps Ethernet (Backbones,
Switches, Routers): IEEE 802.3 WG
• Energy-efficient and power-conscious protocols
– On-the-fly link speed reduction for under-utilized links
CCGrid '11
High-speed Ethernet Consortium (10GE/40GE/100GE)
20
• Network speed bottlenecks
• Protocol processing bottlenecks
• I/O interface bottlenecks
CCGrid '11 21
Tackling Communication Bottlenecks with IB and HSE
• Bit serial differential signaling
– Independent pairs of wires to transmit independent
data (called a lane)
– Scalable to any number of lanes
– Easy to increase clock speed of lanes (since each lane
consists only of a pair of wires)
• Theoretically, no perceived limit on the
bandwidth
CCGrid '11
Network Bottleneck Alleviation: InfiniBand (“Infinite
Bandwidth”) and High-speed Ethernet (10/40/100 GE)
22
CCGrid '11
Network Speed Acceleration with IB and HSE
Ethernet (1979 - ) 10 Mbit/sec
Fast Ethernet (1993 -) 100 Mbit/sec
Gigabit Ethernet (1995 -) 1000 Mbit /sec
ATM (1995 -) 155/622/1024 Mbit/sec
Myrinet (1993 -) 1 Gbit/sec
Fibre Channel (1994 -) 1 Gbit/sec
InfiniBand (2001 -) 2 Gbit/sec (1X SDR)
10-Gigabit Ethernet (2001 -) 10 Gbit/sec
InfiniBand (2003 -) 8 Gbit/sec (4X SDR)
InfiniBand (2005 -) 16 Gbit/sec (4X DDR)
24 Gbit/sec (12X SDR)
InfiniBand (2007 -) 32 Gbit/sec (4X QDR)
40-Gigabit Ethernet (2010 -) 40 Gbit/sec
InfiniBand (2011 -) 56 Gbit/sec (4X FDR)
InfiniBand (2012 -) 100 Gbit/sec (4X EDR)
20 times in the last 9 years
23
2005 - 2006 - 2007 - 2008 - 2009 - 2010 - 2011
Bandwidthperdirection(Gbps)
32G-IB-DDR
48G-IB-DDR
96G-IB-QDR
48G-IB-QDR
200G-IB-EDR
112G-IB-FDR
300G-IB-EDR
168G-IB-FDR
8x HDR
12x HDR
# of
Lanes per
direction
Per Lane & Rounded Per Link Bandwidth (Gb/s)
4G-IB
DDR
8G-IB
QDR
14G-IB-FDR
(14.025)
26G-IB-EDR
(25.78125)
12 48+48 96+96 168+168 300+300
8 32+32 64+64 112+112 200+200
4 16+16 32+32 56+56 100+100
1 4+4 8+8 14+14 25+25
NDR = Next Data Rate
HDR = High Data Rate
EDR = Enhanced Data Rate
FDR = Fourteen Data Rate
QDR = Quad Data Rate
DDR = Double Data Rate
SDR = Single Data Rate (not shown)
x12
x8
x1
2015
12x NDR
8x NDR
32G-IB-QDR
100G-IB-EDR
56G-IB-FDR
16G-IB-DDR
4x HDR
x4
4x NDR
8G-IB-QDR
25G-IB-EDR
14G-IB-FDR
1x HDR
1x NDR
InfiniBand Link Speed Standardization Roadmap
24CCGrid '11
• Network speed bottlenecks
• Protocol processing bottlenecks
• I/O interface bottlenecks
CCGrid '11 25
Tackling Communication Bottlenecks with IB and HSE
• Intelligent Network Interface Cards
• Support entire protocol processing completely in hardware
(hardware protocol offload engines)
• Provide a rich communication interface to applications
– User-level communication capability
– Gets rid of intermediate data buffering requirements
• No software signaling between communication layers
– All layers are implemented on a dedicated hardware unit, and not
on a shared host CPU
CCGrid '11
Capabilities of High-Performance Networks
26
• Fast Messages (FM)
– Developed by UIUC
• Myricom GM
– Proprietary protocol stack from Myricom
• These network stacks set the trend for high-performance
communication requirements
– Hardware offloaded protocol stack
– Support for fast and secure user-level access to the protocol stack
• Virtual Interface Architecture (VIA)
– Standardized by Intel, Compaq, Microsoft
– Precursor to IB
CCGrid '11
Previous High-Performance Network Stacks
27
• Some IB models have multiple hardware accelerators
– E.g., Mellanox IB adapters
• Protocol Offload Engines
– Completely implement ISO/OSI layers 2-4 (link layer, network layer
and transport layer) in hardware
• Additional hardware supported features also present
– RDMA, Multicast, QoS, Fault Tolerance, and many more
CCGrid '11
IB Hardware Acceleration
28
• Interrupt Coalescing
– Improves throughput, but degrades latency
• Jumbo Frames
– No latency impact; Incompatible with existing switches
• Hardware Checksum Engines
– Checksum performed in hardware  significantly faster
– Shown to have minimal benefit independently
• Segmentation Offload Engines (a.k.a. Virtual MTU)
– Host processor “thinks” that the adapter supports large Jumbo
frames, but the adapter splits it into regular sized (1500-byte) frames
– Supported by most HSE products because of its backward
compatibility  considered “regular” Ethernet
– Heavily used in the “server-on-steroids” model
• High performance servers connected to regular clients
CCGrid '11
Ethernet Hardware Acceleration
29
• TCP Offload Engines (TOE)
– Hardware Acceleration for the entire TCP/IP stack
– Initially patented by Tehuti Networks
– Actually refers to the IC on the network adapter that implements
TCP/IP
– In practice, usually referred to as the entire network adapter
• Internet Wide-Area RDMA Protocol (iWARP)
– Standardized by IETF and the RDMA Consortium
– Support acceleration features (like IB) for Ethernet
• http://www.ietf.org & http://www.rdmaconsortium.org
CCGrid '11
TOE and iWARP Accelerators
30
• Also known as “Datacenter Ethernet” or “Lossless Ethernet”
– Combines a number of optional Ethernet standards into one umbrella
as mandatory requirements
• Sample enhancements include:
– Priority-based flow-control: Link-level flow control for each Class of
Service (CoS)
– Enhanced Transmission Selection (ETS): Bandwidth assignment to
each CoS
– Datacenter Bridging Exchange Protocols (DBX): Congestion
notification, Priority classes
– End-to-end Congestion notification: Per flow congestion control to
supplement per link flow control
CCGrid '11
Converged (Enhanced) Ethernet (CEE or CE)
31
• Network speed bottlenecks
• Protocol processing bottlenecks
• I/O interface bottlenecks
CCGrid '11 32
Tackling Communication Bottlenecks with IB and HSE
• InfiniBand initially intended to replace I/O bus
technologies with networking-like technology
– That is, bit serial differential signaling
– With enhancements in I/O technologies that use a similar
architecture (HyperTransport, PCI Express), this has become
mostly irrelevant now
• Both IB and HSE today come as network adapters that plug
into existing I/O technologies
CCGrid '11
Interplay with I/O Technologies
33
• Recent trends in I/O interfaces show that they are nearly
matching head-to-head with network speeds (though they
still lag a little bit)
CCGrid '11
Trends in I/O Interfaces with Servers
PCI 1990 33MHz/32bit: 1.05Gbps (shared bidirectional)
PCI-X
1998 (v1.0)
2003 (v2.0)
133MHz/64bit: 8.5Gbps (shared bidirectional)
266-533MHz/64bit: 17Gbps (shared bidirectional)
AMD HyperTransport (HT)
2001 (v1.0), 2004 (v2.0)
2006 (v3.0), 2008 (v3.1)
102.4Gbps (v1.0), 179.2Gbps (v2.0)
332.8Gbps (v3.0), 409.6Gbps (v3.1)
(32 lanes)
PCI-Express (PCIe)
by Intel
2003 (Gen1), 2007 (Gen2)
2009 (Gen3 standard)
Gen1: 4X (8Gbps), 8X (16Gbps), 16X (32Gbps)
Gen2: 4X (16Gbps), 8X (32Gbps), 16X (64Gbps)
Gen3: 4X (~32Gbps), 8X (~64Gbps), 16X (~128Gbps)
Intel QuickPath
Interconnect (QPI)
2009 153.6-204.8Gbps (20 lanes)
34
• Introduction
• Why InfiniBand and High-speed Ethernet?
• Overview of IB, HSE, their Convergence and Features
• IB and HSE HW/SW Products and Installations
• Sample Case Studies and Performance Numbers
• Conclusions and Final Q&A
CCGrid '11
Presentation Overview
35
• InfiniBand
– Architecture and Basic Hardware Components
– Communication Model and Semantics
– Novel Features
– Subnet Management and Services
• High-speed Ethernet Family
– Internet Wide Area RDMA Protocol (iWARP)
– Alternate vendor-specific protocol stacks
• InfiniBand/Ethernet Convergence Technologies
– Virtual Protocol Interconnect (VPI)
– (InfiniBand) RDMA over Ethernet (RoE)
– (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE)
CCGrid '11
IB, HSE and their Convergence
36
CCGrid '11 37
Comparing InfiniBand with Traditional Networking Stack
Application Layer
MPI, PGAS, File Systems
Transport Layer
OpenFabrics Verbs
RC (reliable), UD (unreliable)
Link Layer
Flow-control, Error Detection
Physical Layer
InfiniBand
Copper or Optical
HTTP, FTP, MPI,
File Systems
Routing
Physical Layer
Link Layer
Network Layer
Transport Layer
Application Layer
Traditional Ethernet
Sockets Interface
TCP, UDP
Flow-control and
Error Detection
Copper, Optical or Wireless
Network Layer Routing
OpenSM (management tool)
DNS management tools
• InfiniBand
– Architecture and Basic Hardware Components
– Communication Model and Semantics
• Communication Model
• Memory registration and protection
• Channel and memory semantics
– Novel Features
• Hardware Protocol Offload
– Link, network and transport layer features
– Subnet Management and Services
CCGrid '11
IB Overview
38
• Used by processing and I/O units to
connect to fabric
• Consume & generate IB packets
• Programmable DMA engines with
protection features
• May have multiple ports
– Independent buffering channeled
through Virtual Lanes
• Host Channel Adapters (HCAs)
CCGrid '11
Components: Channel Adapters
39
C
Port
VL
VL
VL
…
Port
VL
VL
VL
…
Port
VL
VL
VL
…
…
DMA
Memory
QP
QP
QP
QP
…
MTP
SMA
Transport
Channel Adapter
• Relay packets from a link to another
• Switches: intra-subnet
• Routers: inter-subnet
• May support multicast
CCGrid '11
Components: Switches and Routers
40
Packet Relay
Port
VL
VL
VL
…
Port
VL
VL
VL
…
Port
VL
VL
VL
…
…
Switch
GRH Packet Relay
Port
VL
VL
VL
…
Port
VL
VL
VL
…
Port
VL
VL
VL
…
…
Router
• Network Links
– Copper, Optical, Printed Circuit wiring on Back Plane
– Not directly addressable
• Traditional adapters built for copper cabling
– Restricted by cable length (signal integrity)
– For example, QDR copper cables are restricted to 7m
• Intel Connects: Optical cables with Copper-to-optical
conversion hubs (acquired by Emcore)
– Up to 100m length
– 550 picoseconds
copper-to-optical conversion latency
• Available from other vendors (Luxtera)
• Repeaters (Vol. 2 of InfiniBand specification)
CCGrid '11
Components: Links & Repeaters
(Courtesy Intel)
41
• InfiniBand
– Architecture and Basic Hardware Components
– Communication Model and Semantics
• Communication Model
• Memory registration and protection
• Channel and memory semantics
– Novel Features
• Hardware Protocol Offload
– Link, network and transport layer features
– Subnet Management and Services
CCGrid '11
IB Overview
42
CCGrid '11
IB Communication Model
Basic InfiniBand
Communication
Semantics
43
• Each QP has two queues
– Send Queue (SQ)
– Receive Queue (RQ)
– Work requests are queued to the QP
(WQEs: “Wookies”)
• QP to be linked to a Complete Queue
(CQ)
– Gives notification of operation
completion from QPs
– Completed WQEs are placed in the
CQ with additional information
(CQEs: “Cookies”)
CCGrid '11
Queue Pair Model
InfiniBand Device
CQQP
Send Recv
WQEs CQEs
44
1. Registration Request
• Send virtual address and length
2. Kernel handles virtual->physical
mapping and pins region into
physical memory
• Process cannot map memory
that it does not own (security !)
3. HCA caches the virtual to physical
mapping and issues a handle
• Includes an l_key and r_key
4. Handle is returned to application
CCGrid '11
Memory Registration
Before we do any communication:
All memory used for communication must
be registered
1
3
4
Process
Kernel
HCA/RNIC
2
45
• To send or receive data the l_key
must be provided to the HCA
• HCA verifies access to local
memory
• For RDMA, initiator must have the
r_key for the remote virtual address
• Possibly exchanged with a
send/recv
• r_key is not encrypted in IB
CCGrid '11
Memory Protection
HCA/NIC
Kernel
Process
l_key
r_key is needed for RDMA operations
For security, keys are required for all
operations that touch buffers
46
CCGrid '11
Communication in the Channel Semantics
(Send/Receive Model)
InfiniBand Device
Memory Memory
InfiniBand Device
CQ
QP
Send Recv
Memory
Segment
Send WQE contains information about the
send buffer (multiple non-contiguous
segments)
Processor Processor
CQ
QP
Send Recv
Memory
Segment
Receive WQE contains information on the receive
buffer (multiple non-contiguous segments);
Incoming messages have to be matched to a
receive WQE to know where to place
Hardware ACK
Memory
Segment
Memory
Segment
Memory
Segment
Processor is involved only to:
1. Post receive WQE
2. Post send WQE
3. Pull out completed CQEs from the CQ
47
CCGrid '11
Communication in the Memory Semantics (RDMA Model)
InfiniBand Device
Memory Memory
InfiniBand Device
CQ
QP
Send Recv
Memory
Segment
Send WQE contains information about the
send buffer (multiple segments) and the
receive buffer (single segment)
Processor Processor
CQ
QP
Send Recv
Memory
Segment
Hardware ACK
Memory
Segment
Memory
Segment
Initiator processor is involved only to:
1. Post send WQE
2. Pull out completed CQE from the send CQ
No involvement from the target processor
48
InfiniBand Device
CCGrid '11
Communication in the Memory Semantics (Atomics)
Memory Memory
InfiniBand Device
CQ
QP
Send Recv
Memory
Segment
Send WQE contains information about the
send buffer (single 64-bit segment) and the
receive buffer (single 64-bit segment)
Processor Processor
CQ
QP
Send Recv
49
Source
Memory
Segment
OP
Destination
Memory
Segment
Initiator processor is involved only to:
1. Post send WQE
2. Pull out completed CQE from the send CQ
No involvement from the target processor
IB supports compare-and-swap and
fetch-and-add atomic operations
• InfiniBand
– Architecture and Basic Hardware Components
– Communication Model and Semantics
• Communication Model
• Memory registration and protection
• Channel and memory semantics
– Novel Features
• Hardware Protocol Offload
– Link, network and transport layer features
– Subnet Management and Services
CCGrid '11
IB Overview
50
CCGrid '11
Hardware Protocol Offload
Complete
Hardware
Implementations
Exist
51
• Buffering and Flow Control
• Virtual Lanes, Service Levels and QoS
• Switching and Multicast
CCGrid '11
Link/Network Layer Capabilities
52
• IB provides three-levels of communication throttling/control
mechanisms
– Link-level flow control (link layer feature)
– Message-level flow control (transport layer feature): discussed later
– Congestion control (part of the link layer features)
• IB provides an absolute credit-based flow-control
– Receiver guarantees that enough space is allotted for N blocks of data
– Occasional update of available credits by the receiver
• Has no relation to the number of messages, but only to the
total amount of data being sent
– One 1MB message is equivalent to 1024 1KB messages (except for
rounding off at message boundaries)
CCGrid '11
Buffering and Flow Control
53
• Multiple virtual links within
same physical link
– Between 2 and 16
• Separate buffers and flow
control
– Avoids Head-of-Line
Blocking
• VL15: reserved for
management
• Each port supports one or
more data VL
CCGrid '11
Virtual Lanes
54
• Service Level (SL):
– Packets may operate at one of 16 different SLs
– Meaning not defined by IB
• SL to VL mapping:
– SL determines which VL on the next link is to be used
– Each port (switches, routers, end nodes) has a SL to VL mapping
table configured by the subnet management
• Partitions:
– Fabric administration (through Subnet Manager) may assign
specific SLs to different partitions to isolate traffic flows
CCGrid '11
Service Levels and QoS
55
• InfiniBand Virtual Lanes
allow the multiplexing of
multiple independent logical
traffic flows on the same
physical link
• Providing the benefits of
independent, separate
networks while eliminating
the cost and difficulties
associated with maintaining
two or more networks
CCGrid '11
Traffic Segregation Benefits
(Courtesy: Mellanox Technologies)
Routers, Switches
VPN’s, DSLAMs
Storage Area Network
RAID, NAS, Backup
IPC, Load Balancing, Web Caches, ASP
InfiniBand
Network
Virtual Lanes
Servers
Fabric
ServersServers
IP Network
InfiniBand
Fabric
56
• Each port has one or more associated LIDs (Local
Identifiers)
– Switches look up which port to forward a packet to based on its
destination LID (DLID)
– This information is maintained at the switch
• For multicast packets, the switch needs to maintain
multiple output ports to forward the packet to
– Packet is replicated to each appropriate output port
– Ensures at-most once delivery & loop-free forwarding
– There is an interface for a group management protocol
• Create, join/leave, prune, delete group
CCGrid '11
Switching (Layer-2 Routing) and Multicast
57
• Basic unit of switching is a crossbar
– Current InfiniBand products use either 24-port (DDR) or 36-port
(QDR) crossbars
• Switches available in the market are typically collections of
crossbars within a single cabinet
• Do not confuse “non-blocking switches” with “crossbars”
– Crossbars provide all-to-all connectivity to all connected nodes
• For any random node pair selection, all communication is non-blocking
– Non-blocking switches provide a fat-tree of many crossbars
• For any random node pair selection, there exists a switch
configuration such that communication is non-blocking
• If the communication pattern changes, the same switch configuration
might no longer provide fully non-blocking communication
CCGrid '11 58
Switch Complex
• Someone has to setup the forwarding tables and
give every port an LID
– “Subnet Manager” does this work
• Different routing algorithms give different paths
CCGrid '11
IB Switching/Routing: An Example
Leaf Blocks
Spine Blocks
P1P2 DLID Out-Port
2 1
4 4
Forwarding TableLID: 2
LID: 4
1 2 3 4
59
An Example IB Switch Block Diagram (Mellanox 144-Port) Switching: IB supports
Virtual Cut Through (VCT)
Routing: Unspecified by IB SPEC
Up*/Down*, Shift are popular
routing engines supported by OFED
• Fat-Tree is a popular
topology for IB Cluster
– Different over-subscription
ratio may be used
• Other topologies are also
being used
– 3D Torus (Sandia Red Sky)
and SGI Altix (Hypercube)
• Similar to basic switching, except…
– … sender can utilize multiple LIDs associated to the same
destination port
• Packets sent to one DLID take a fixed path
• Different packets can be sent using different DLIDs
• Each DLID can have a different path (switch can be configured
differently for each DLID)
• Can cause out-of-order arrival of packets
– IB uses a simplistic approach:
• If packets in one connection arrive out-of-order, they are dropped
– Easier to use different DLIDs for different connections
• This is what most high-level libraries using IB do!
CCGrid '11 60
More on Multipathing
CCGrid '11
IB Multicast Example
61
CCGrid '11
Hardware Protocol Offload
Complete
Hardware
Implementations
Exist
62
• Each transport service can have zero or more QPs
associated with it
– E.g., you can have four QPs based on RC and one QP based on UD
CCGrid '11
IB Transport Services
Service Type
Connection
Oriented
Acknowledged Transport
Reliable Connection Yes Yes IBA
Unreliable Connection Yes No IBA
Reliable Datagram No Yes IBA
Unreliable Datagram No No IBA
RAW Datagram No No Raw
63
CCGrid '11
Trade-offs in Different Transport Types
64
Attribute Reliable
Connection
Reliable
Datagram
eXtended
Reliable
Connection
Unreliable
Connection
Unreliable
Datagram
Raw
Datagram
Scalability
(M processes, N
nodes)
M2N QPs
per HCA
M QPs
per HCA
MNQPs
per HCA
M2N QPs
per HCA
M QPs
per HCA
1 QP
per HCA
Reliability
Corrupt
data
detected
Yes
Data
Delivery
Guarantee
Data delivered exactly once No guarantees
Data Order
Guarantees Per connection
One source to
multiple
destinations
Per connection
Unordered,
duplicate data
detected
No No
Data Loss
Detected
Yes No No
Error
Recovery
Errors (retransmissions, alternate path, etc.)
handled by transport layer. Client only involved in
handling fatal errors (links broken, protection
violation, etc.)
Packets with
errors and
sequence
errors are
reported to
responder
None None
• Data Segmentation
• Transaction Ordering
• Message-level Flow Control
• Static Rate Control and Auto-negotiation
CCGrid '11
Transport Layer Capabilities
65
• IB transport layer provides a message-level communication
granularity, not byte-level (unlike TCP)
• Application can hand over a large message
– Network adapter segments it to MTU sized packets
– Single notification when the entire message is transmitted or
received (not per packet)
• Reduced host overhead to send/receive messages
– Depends on the number of messages, not the number of bytes
CCGrid '11 66
Data Segmentation
• IB follows a strong transaction ordering for RC
• Sender network adapter transmits messages in the order
in which WQEs were posted
• Each QP utilizes a single LID
– All WQEs posted on same QP take the same path
– All packets are received by the receiver in the same order
– All receive WQEs are completed in the order in which they were
posted
CCGrid '11
Transaction Ordering
67
• Also called as End-to-end Flow-control
– Does not depend on the number of network hops
• Separate from Link-level Flow-Control
– Link-level flow-control only relies on the number of bytes being
transmitted, not the number of messages
– Message-level flow-control only relies on the number of messages
transferred, not the number of bytes
• If 5 receive WQEs are posted, the sender can send 5
messages (can post 5 send WQEs)
– If the sent messages are larger than what the receive buffers are
posted, flow-control cannot handle it
CCGrid '11 68
Message-level Flow-Control
• IB allows link rates to be statically changed
– On a 4X link, we can set data to be sent at 1X
– For heterogeneous links, rate can be set to the lowest link rate
– Useful for low-priority traffic
• Auto-negotiation also available
– E.g., if you connect a 4X adapter to a 1X switch, data is
automatically sent at 1X rate
• Only fixed settings available
– Cannot set rate requirement to 3.16 Gbps, for example
CCGrid '11 69
Static Rate Control and Auto-Negotiation
• InfiniBand
– Architecture and Basic Hardware Components
– Communication Model and Semantics
• Communication Model
• Memory registration and protection
• Channel and memory semantics
– Novel Features
• Hardware Protocol Offload
– Link, network and transport layer features
– Subnet Management and Services
CCGrid '11
IB Overview
70
• Agents
– Processes or hardware units running on each adapter, switch,
router (everything on the network)
– Provide capability to query and set parameters
• Managers
– Make high-level decisions and implement it on the network fabric
using the agents
• Messaging schemes
– Used for interactions between the manager and agents (or
between agents)
• Messages
CCGrid '11
Concepts in IB Management
71
Inactive
Links
CCGrid '11
Subnet Manager
Active
Links
Compute
Node
Switch
Subnet
Manager
Inactive
Link
Multicast Join
Multicast
Setup
Multicast Join
Multicast
Setup
72
• InfiniBand
– Architecture and Basic Hardware Components
– Communication Model and Semantics
– Novel Features
– Subnet Management and Services
• High-speed Ethernet Family
– Internet Wide Area RDMA Protocol (iWARP)
– Alternate vendor-specific protocol stacks
• InfiniBand/Ethernet Convergence Technologies
– Virtual Protocol Interconnect (VPI)
– (InfiniBand) RDMA over Ethernet (RoE)
– (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE)
CCGrid '11
IB, HSE and their Convergence
73
• High-speed Ethernet Family
– Internet Wide-Area RDMA Protocol (iWARP)
• Architecture and Components
• Features
– Out-of-order data placement
– Dynamic and Fine-grained Data Rate control
– Multipathing using VLANs
• Existing Implementations of HSE/iWARP
– Alternate Vendor-specific Stacks
• MX over Ethernet (for Myricom 10GE adapters)
• Datagram Bypass Layer (for Myricom 10GE adapters)
• Solarflare OpenOnload (for Solarflare 10GE adapters)
CCGrid '11
HSE Overview
74
CCGrid '11
IB and HSE RDMA Models: Commonalities and
Differences
IB iWARP/HSE
Hardware Acceleration Supported Supported
RDMA Supported Supported
Atomic Operations Supported Not supported
Multicast Supported Supported
Congestion Control Supported Supported
Data Placement Ordered Out-of-order
Data Rate-control Static and Coarse-grained Dynamic and Fine-grained
QoS Prioritization
Prioritization and
Fixed Bandwidth QoS
Multipathing Using DLIDs Using VLANs
75
• RDMA Protocol (RDMAP)
– Feature-rich interface
– Security Management
• Remote Direct Data Placement (RDDP)
– Data Placement and Delivery
– Multi Stream Semantics
– Connection Management
• Marker PDU Aligned (MPA)
– Middle Box Fragmentation
– Data Integrity (CRC)
CCGrid '11
iWARP Architecture and Components
RDDP
Application or Library
Hardware
User
SCTP
IP
Device Driver
Network Adapter
(e.g., 10GigE)
iWARP Offload Engines
TCP
RDMAP
MPA
(Courtesy iWARP Specification)
76
• High-speed Ethernet Family
– Internet Wide-Area RDMA Protocol (iWARP)
• Architecture and Components
• Features
– Out-of-order data placement
– Dynamic and Fine-grained Data Rate control
• Existing Implementations of HSE/iWARP
– Alternate Vendor-specific Stacks
• MX over Ethernet (for Myricom 10GE adapters)
• Datagram Bypass Layer (for Myricom 10GE adapters)
• Solarflare OpenOnload (for Solarflare 10GE adapters)
CCGrid '11
HSE Overview
77
• Place data as it arrives, whether in or out-of-order
• If data is out-of-order, place it at the appropriate offset
• Issues from the application’s perspective:
– Second half of the message has been placed does not mean that
the first half of the message has arrived as well
– If one message has been placed, it does not mean that that the
previous messages have been placed
• Issues from protocol stack’s perspective
– The receiver network stack has to understand each frame of data
• If the frame is unchanged during transmission, this is easy!
– The MPA protocol layer adds appropriate information at regular
intervals to allow the receiver to identify fragmented frames
CCGrid '11
Decoupled Data Placement and Data Delivery
78
• Part of the Ethernet standard, not iWARP
– Network vendors use a separate interface to support it
• Dynamic bandwidth allocation to flows based on interval
between two packets in a flow
– E.g., one stall for every packet sent on a 10 Gbps network refers to
a bandwidth allocation of 5 Gbps
– Complicated because of TCP windowing behavior
• Important for high-latency/high-bandwidth networks
– Large windows exposed on the receiver side
– Receiver overflow controlled through rate control
CCGrid '11
Dynamic and Fine-grained Rate Control
79
• Can allow for simple prioritization:
– E.g., connection 1 performs better than connection 2
– 8 classes provided (a connection can be in any class)
• Similar to SLs in InfiniBand
– Two priority classes for high-priority traffic
• E.g., management traffic or your favorite application
• Or can allow for specific bandwidth requests:
– E.g., can request for 3.62 Gbps bandwidth
– Packet pacing and stalls used to achieve this
• Query functionality to find out “remaining bandwidth”
CCGrid '11
Prioritization and Fixed Bandwidth QoS
80
• High-speed Ethernet Family
– Internet Wide-Area RDMA Protocol (iWARP)
• Architecture and Components
• Features
– Out-of-order data placement
– Dynamic and Fine-grained Data Rate control
• Existing Implementations of HSE/iWARP
– Alternate Vendor-specific Stacks
• MX over Ethernet (for Myricom 10GE adapters)
• Datagram Bypass Layer (for Myricom 10GE adapters)
• Solarflare OpenOnload (for Solarflare 10GE adapters)
CCGrid '11
HSE Overview
81
• Regular Ethernet adapters and
TOEs are fully compatible
• Compatibility with iWARP
required
• Software iWARP emulates the
functionality of iWARP on the
host
– Fully compatible with hardware
iWARP
– Internally utilizes host TCP/IP stack
CCGrid '11
Software iWARP based Compatibility
82
Wide
Area
Network
Distributed Cluster
Environment
Regular Ethernet
Cluster
iWARP
Cluster
TOE
Cluster
CCGrid '11
Different iWARP Implementations
Regular Ethernet Adapters
Application
High Performance Sockets
Sockets
Network Adapter
TCP
IP
Device Driver
Offloaded TCP
Offloaded IP
Software
iWARP
TCP Offload Engines
Application
High Performance Sockets
Sockets
Network Adapter
TCP
IP
Device Driver
Offloaded TCP
Offloaded IP
Offloaded iWARP
iWARP compliant
Adapters
Application
Kernel-level
iWARP
TCP (Modified with MPA)
IP
Device Driver
Network Adapter
Sockets
Application
User-level iWARP
IP
Sockets
TCP
Device Driver
Network Adapter
OSU, OSC, IBM OSU, ANL Chelsio, NetEffect (Intel)
83
• High-speed Ethernet Family
– Internet Wide-Area RDMA Protocol (iWARP)
• Architecture and Components
• Features
– Out-of-order data placement
– Dynamic and Fine-grained Data Rate control
– Multipathing using VLANs
• Existing Implementations of HSE/iWARP
– Alternate Vendor-specific Stacks
• MX over Ethernet (for Myricom 10GE adapters)
• Datagram Bypass Layer (for Myricom 10GE adapters)
• Solarflare OpenOnload (for Solarflare 10GE adapters)
CCGrid '11
HSE Overview
84
• Proprietary communication layer developed by Myricom for
their Myrinet adapters
– Third generation communication layer (after FM and GM)
– Supports Myrinet-2000 and the newer Myri-10G adapters
• Low-level “MPI-like” messaging layer
– Almost one-to-one match with MPI semantics (including connection-
less model, implicit memory registration and tag matching)
– Later versions added some more advanced communication methods
such as RDMA to support other programming models such as ARMCI
(low-level runtime for the Global Arrays PGAS library)
• Open-MX
– New open-source implementation of the MX interface for non-
Myrinet adapters from INRIA, France
CCGrid '11 85
Myrinet Express (MX)
• Another proprietary communication layer developed by
Myricom
– Compatible with regular UDP sockets (embraces and extends)
– Idea is to bypass the kernel stack and give UDP applications direct
access to the network adapter
• High performance and low-jitter
• Primary motivation: Financial market applications (e.g.,
stock market)
– Applications prefer unreliable communication
– Timeliness is more important than reliability
• This stack is covered by NDA; more details can be
requested from Myricom
CCGrid '11 86
Datagram Bypass Layer (DBL)
CCGrid '11
Solarflare Communications: OpenOnload Stack
87
Typical HPC Networking StackTypical Commodity Networking Stack
• HPC Networking Stack provides many
performance benefits, but has limitations
for certain types of scenarios, especially
where applications tend to fork(), exec()
and need asynchronous advancement (per
application)
Solarflare approach to networking stack
• Solarflare approach:
 Network hardware provides user-safe
interface to route packets directly to apps
based on flow information in headers
 Protocol processing can happen in both
kernel and user space
 Protocol state shared between app and
kernel using shared memory
Courtesy Solarflare communications (www.openonload.org/openonload-google-talk.pdf)
• InfiniBand
– Architecture and Basic Hardware Components
– Communication Model and Semantics
– Novel Features
– Subnet Management and Services
• High-speed Ethernet Family
– Internet Wide Area RDMA Protocol (iWARP)
– Alternate vendor-specific protocol stacks
• InfiniBand/Ethernet Convergence Technologies
– Virtual Protocol Interconnect (VPI)
– (InfiniBand) RDMA over Ethernet (RoE)
– (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE)
CCGrid '11
IB, HSE and their Convergence
88
• Single network firmware to support
both IB and Ethernet
• Autosensing of layer-2 protocol
– Can be configured to automatically
work with either IB or Ethernet
networks
• Multi-port adapters can use one
port on IB and another on Ethernet
• Multiple use modes:
– Datacenters with IB inside the
cluster and Ethernet outside
– Clusters with IB network and
Ethernet management
CCGrid '11
Virtual Protocol Interconnect (VPI)
IB Link Layer
IB Port Ethernet Port
Hardware
TCP/IP
support
Ethernet
Link Layer
IB Network
Layer
IP
IB Transport
Layer
TCP
IB Verbs Sockets
Applications
89
• Native convergence of IB network and transport
layers with Ethernet link layer
• IB packets encapsulated in Ethernet frames
• IB network layer already uses IPv6 frames
• Pros:
– Works natively in Ethernet environments (entire
Ethernet management ecosystem is available)
– Has all the benefits of IB verbs
• Cons:
– Network bandwidth might be limited to Ethernet
switches: 10GE switches available; 40GE yet to
arrive; 32 Gbps IB available
– Some IB native link-layer features are optional in
(regular) Ethernet
• Approved by OFA board to be included into OFED
CCGrid '11
(InfiniBand) RDMA over Ethernet (IBoE or RoE)
Ethernet
IB Network
IB Transport
IB Verbs
Application
Hardware
90
• Very similar to IB over Ethernet
– Often used interchangeably with IBoE
– Can be used to explicitly specify link layer is
Converged (Enhanced) Ethernet (CE)
• Pros:
– Works natively in Ethernet environments (entire
Ethernet management ecosystem is available)
– Has all the benefits of IB verbs
– CE is very similar to the link layer of native IB, so
there are no missing features
• Cons:
– Network bandwidth might be limited to Ethernet
switches: 10GE switches available; 40GE yet to
arrive; 32 Gbps IB available
CCGrid '11
(InfiniBand) RDMA over Converged (Enhanced) Ethernet
(RoCE)
CE
IB Network
IB Transport
IB Verbs
Application
Hardware
91
CCGrid '11
IB and HSE: Feature Comparison
IB iWARP/HSE RoE RoCE
Hardware Acceleration Yes Yes Yes Yes
RDMA Yes Yes Yes Yes
Congestion Control Yes Optional Optional Yes
Multipathing Yes Yes Yes Yes
Atomic Operations Yes No Yes Yes
Multicast Optional No Optional Optional
Data Placement Ordered Out-of-order Ordered Ordered
Prioritization Optional Optional Optional Yes
Fixed BW QoS (ETS) No Optional Optional Yes
Ethernet Compatibility No Yes Yes Yes
TCP/IP Compatibility
Yes
(using IPoIB)
Yes
Yes
(using IPoIB)
Yes
(using IPoIB)
92
• Introduction
• Why InfiniBand and High-speed Ethernet?
• Overview of IB, HSE, their Convergence and Features
• IB and HSE HW/SW Products and Installations
• Sample Case Studies and Performance Numbers
• Conclusions and Final Q&A
CCGrid '11
Presentation Overview
93
• Many IB vendors: Mellanox+Voltaire and Qlogic
– Aligned with many server vendors: Intel, IBM, SUN, Dell
– And many integrators: Appro, Advanced Clustering, Microway
• Broadly two kinds of adapters
– Offloading (Mellanox) and Onloading (Qlogic)
• Adapters with different interfaces:
– Dual port 4X with PCI-X (64 bit/133 MHz), PCIe x8, PCIe 2.0 and HT
• MemFree Adapter
– No memory on HCA  Uses System memory (through PCIe)
– Good for LOM designs (Tyan S2935, Supermicro 6015T-INFB)
• Different speeds
– SDR (8 Gbps), DDR (16 Gbps) and QDR (32 Gbps)
• Some 12X SDR adapters exist as well (24 Gbps each way)
• New ConnectX-2 adapter from Mellanox supports offload for
collectives (Barrier, Broadcast, etc.)
CCGrid '11
IB Hardware Products
94
CCGrid '11
Tyan Thunder S2935 Board
(Courtesy Tyan)
Similar boards from Supermicro with LOM features are also available
95
• Customized adapters to work with IB switches
– Cray XD1 (formerly by Octigabay), Cray CX1
• Switches:
– 4X SDR and DDR (8-288 ports); 12X SDR (small sizes)
– 3456-port “Magnum” switch from SUN  used at TACC
• 72-port “nano magnum”
– 36-port Mellanox InfiniScale IV QDR switch silicon in 2008
• Up to 648-port QDR switch by Mellanox and SUN
• Some internal ports are 96 Gbps (12X QDR)
– IB switch silicon from Qlogic introduced at SC ’08
• Up to 846-port QDR switch by Qlogic
– New FDR (56 Gbps) switch silicon (Bridge-X) has been announced by
Mellanox in May ‘11
• Switch Routers with Gateways
– IB-to-FC; IB-to-IP
CCGrid '11
IB Hardware Products (contd.)
96
• 10GE adapters: Intel, Myricom, Mellanox (ConnectX)
• 10GE/iWARP adapters: Chelsio, NetEffect (now owned by Intel)
• 40GE adapters: Mellanox ConnectX2-EN 40G
• 10GE switches
– Fulcrum Microsystems
• Low latency switch based on 24-port silicon
• FM4000 switch with IP routing, and TCP/UDP support
– Fujitsu, Myricom (512 ports), Force10, Cisco, Arista (formerly Arastra)
• 40GE and 100GE switches
– Nortel Networks
• 10GE downlinks with 40GE and 100GE uplinks
– Broadcom has announced 40GE switch in early 2010
CCGrid '11
10G, 40G and 100G Ethernet Products
97
• Mellanox ConnectX Adapter
• Supports IB and HSE convergence
• Ports can be configured to support IB or HSE
• Support for VPI and RoCE
– 8 Gbps (SDR), 16Gbps (DDR) and 32Gbps (QDR) rates available
for IB
– 10GE rate available for RoCE
– 40GE rate for RoCE is expected to be available in near future
CCGrid '11
Products Providing IB and HSE Convergence
98
• Open source organization (formerly OpenIB)
– www.openfabrics.org
• Incorporates both IB and iWARP in a unified manner
– Support for Linux and Windows
– Design of complete stack with `best of breed’ components
• Gen1
• Gen2 (current focus)
• Users can download the entire stack and run
– Latest release is OFED 1.5.3
– OFED 1.6 is underway
CCGrid '11
Software Convergence with OpenFabrics
99
CCGrid '11
OpenFabrics Stack with Unified Verbs Interface
Verbs Interface
(libibverbs)
Mellanox
(libmthca)
QLogic
(libipathverbs)
IBM
(libehca)
Chelsio
(libcxgb3)
Mellanox
(ib_mthca)
QLogic
(ib_ipath)
IBM
(ib_ehca)
Chelsio
(ib_cxgb3)
User Level
Kernel Level
Mellanox
Adapters
QLogic
Adapters
Chelsio
Adapters
IBM
Adapters
100
• For IBoE and RoCE, the upper-level
stacks remain completely
unchanged
• Within the hardware:
– Transport and network layers remain
completely unchanged
– Both IB and Ethernet (or CEE) link
layers are supported on the network
adapter
• Note: The OpenFabrics stack is not
valid for the Ethernet path in VPI
– That still uses sockets and TCP/IP
CCGrid '11
OpenFabrics on Convergent IB/HSE
Verbs Interface
(libibverbs)
ConnectX
(libmlx4)
ConnectX
(ib_mlx4)
User Level
Kernel Level
ConnectX
Adapters
HSE IB
101
CCGrid '11
OpenFabrics Software Stack
SA Subnet Administrator
MAD Management Datagram
SMA Subnet Manager Agent
PMA Performance Manager
Agent
IPoIB IP over InfiniBand
SDP Sockets Direct Protocol
SRP SCSI RDMA Protocol
(Initiator)
iSER iSCSI RDMA Protocol
(Initiator)
RDS Reliable Datagram Service
UDAPL User Direct Access
Programming Lib
HCA Host Channel Adapter
R-NIC RDMA NIC
Common
InfiniBand
iWARP
Key
InfiniBand HCA iWARP R-NIC
Hardware
Specific Driver
Hardware Specific
Driver
Connection
Manager
MAD
InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC
SA
Client
Connection
Manager
Connection Manager
Abstraction (CMA)
InfiniBand OpenFabrics User Level Verbs / API iWARP R-NIC
SDPIPoIB SRP iSER RDS
SDP Lib
User Level
MAD API
Open
SM
Diag
Tools
Hardware
Provider
Mid-Layer
Upper
Layer
Protocol
User
APIs
Kernel Space
User Space
NFS-RDMA
RPC
Cluster
File Sys
Application
Level
SMA
Clustered
DB Access
Sockets
Based
Access
Various
MPIs
Access to
File
Systems
Block
Storage
Access
IP Based
App
Access
Apps &
Access
Methods
for using
OF Stack
UDAPL
Kernelbypass
Kernelbypass
102
CCGrid '11 103
InfiniBand in the Top500
Percentage share of InfiniBand is steadily increasing
45%
43%
6%
1%
0% 0% 1% 0% 0%0%
4%
Number of Systems
Gigabit Ethernet InfiniBand Proprietary
Myrinet Quadrics Mixed
NUMAlink SP Switch Cray Interconnect
Fat Tree Custom
CCGrid '11 104
InfiniBand in the Top500 (Nov. 2010)
25%
44%
21%
1%0%
0% 0%
0%
0%
0%
9%
Performance
Gigabit Ethernet Infiniband Proprietary
Myrinet Quadrics Mixed
NUMAlink SP Switch Cray Interconnect
Fat Tree Custom
105
InfiniBand System Efficiency in the Top500 List
CCGrid '11
0
10
20
30
40
50
60
70
80
90
100
0 50 100 150 200 250 300 350 400 450 500
Efficiency(%)
Top 500 Systems
Computer Cluster Efficiency Comparison
IB-CPU IB-GPU/Cell GigE 10GigE IBM-BlueGene Cray
• 214 IB Clusters (42.8%) in the Nov ‘10 Top500 list (http://www.top500.org)
• Installations in the Top 30 (13 systems):
CCGrid '11
Large-scale InfiniBand Installations
120,640 cores (Nebulae) in China (3rd) 15,120 cores (Loewe) in Germany (22nd)
73,278 cores (Tsubame-2.0) at Japan (4th) 26,304 cores (Juropa) in Germany (23rd)
138,368 cores (Tera-100) at France (6th) 26,232 cores (Tachyonll) in South Korea (24th)
122,400 cores (RoadRunner) at LANL (7th) 23,040 cores (Jade) at GENCI (27th)
81,920 cores (Pleiades) at NASA Ames (11th) 33,120 cores (Mole-8.5) in China (28th)
42,440 cores (Red Sky) at Sandia (14th) More are getting installed !
62,976 cores (Ranger) at TACC (15th)
35,360 cores (Lomonosov) in Russia (17th)
106
• HSE compute systems with ranking in the Nov 2010 Top500 list
– 8,856-core installation in Purdue with ConnectX-EN 10GigE (#126)
– 7,944-core installation in Purdue with 10GigE Chelsio/iWARP (#147)
– 6,828-core installation in Germany (#166)
– 6,144-core installation in Germany (#214)
– 6,144-core installation in Germany (#215)
– 7,040-core installation at the Amazon EC2 Cluster (#231)
– 4,000-core installation at HHMI (#349)
• Other small clusters
– 640-core installation in University of Heidelberg, Germany
– 512-core installation at Sandia National Laboratory (SNL) with
Chelsio/iWARP and Woven Systems switch
– 256-core installation at Argonne National Lab with Myri-10G
• Integrated Systems
– BG/P uses 10GE for I/O (ranks 9, 13, 16, and 34 in the Top 50)
CCGrid '11 107
HSE Scientific Computing Installations
• HSE has most of its popularity in enterprise computing and
other non-scientific markets including Wide-area
networking
• Example Enterprise Computing Domains
– Enterprise Datacenters (HP, Intel)
– Animation firms (e.g., Universal Studios (“The Hulk”), 20th Century
Fox (“Avatar”), and many new movies using 10GE)
– Amazon’s HPC cloud offering uses 10GE internally
– Heavily used in financial markets (users are typically undisclosed)
• Many Network-attached Storage devices come integrated
with 10GE network adapters
• ESnet to install $62M 100GE infrastructure for US DOE
CCGrid '11 108
Other HSE Installations
• Introduction
• Why InfiniBand and High-speed Ethernet?
• Overview of IB, HSE, their Convergence and Features
• IB and HSE HW/SW Products and Installations
• Sample Case Studies and Performance Numbers
• Conclusions and Final Q&A
CCGrid '11
Presentation Overview
109
Modern Interconnects and Protocols
110
Application
VerbsSockets
Application
Interface
TCP/IP
Hardware
Offload
TCP/IP
Ethernet
Driver
Kernel
Space
Protocol
Implementation
1/10 GigE
Adapter
Ethernet
Switch
Network
Adapter
Network
Switch
1/10 GigE
InfiniBand
Adapter
InfiniBand
Switch
IPoIB
IPoIB
SDP
RDMA
User
space
IB Verbs
InfiniBand
Adapter
InfiniBand
Switch
SDP
InfiniBand
Adapter
InfiniBand
Switch
RDMA
10 GigE
Adapter
10 GigE
Switch
10 GigE-TOE
• Low-level Network Performance
• Clusters with Message Passing Interface (MPI)
• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP
(IPoIB)
• InfiniBand in WAN and Grid-FTP
• Cloud Computing: Hadoop and Memcached
CCGrid '11 111
Case Studies
CCGrid '11 112
Low-level Latency Measurements
0
5
10
15
20
25
30
VPI-IB
Native IB
VPI-Eth
RoCE
Small Messages
Latency(us)
Message Size (bytes)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Large Messages
Latency(us)
Message Size (bytes)
ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches
RoCE has a slight overhead compared to native IB because it operates at a slower clock rate (required to support
only a 10Gbps link for Ethernet, as compared to a 32Gbps link for IB)
CCGrid '11 113
Low-level Uni-directional Bandwidth Measurements
0
200
400
600
800
1000
1200
1400
1600
VPI-IB
Native IB
VPI-Eth
RoCE
Uni-directional Bandwidth
Bandwidth(MBps)
Message Size (bytes)
ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches
• Low-level Network Performance
• Clusters with Message Passing Interface (MPI)
• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP
(IPoIB)
• InfiniBand in WAN and Grid-FTP
• Cloud Computing: Hadoop and Memcached
CCGrid '11 114
Case Studies
• High Performance MPI Library for IB and HSE
– MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2)
– Used by more than 1,550 organizations in 60 countries
– More than 66,000 downloads from OSU site directly
– Empowering many TOP500 clusters
• 11th ranked 81,920-core cluster (Pleiades) at NASA
• 15th ranked 62,976-core cluster (Ranger) at TACC
– Available with software stacks of many IB, HSE and server vendors
including Open Fabrics Enterprise Distribution (OFED) and Linux
Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
CCGrid '11 115
MVAPICH/MVAPICH2 Software
CCGrid '11 116
One-way Latency: MPI over IB
0
1
2
3
4
5
6
Small Message Latency
Message Size (bytes)
Latency(us)
1.96
1.54
1.60
2.17
0
50
100
150
200
250
300
350
400
MVAPICH-Qlogic-DDR
MVAPICH-Qlogic-QDR-PCIe2
MVAPICH-ConnectX-DDR
MVAPICH-ConnectX-QDR-PCIe2
Latency(us)
Message Size (bytes)
Large Message Latency
All numbers taken on 2.4 GHz Quad-core (Nehalem) Intel with IB switch
CCGrid '11 117
Bandwidth: MPI over IB
0
500
1000
1500
2000
2500
3000
3500
Unidirectional Bandwidth
MillionBytes/sec
Message Size (bytes)
2665.6
3023.7
1901.1
1553.2
0
1000
2000
3000
4000
5000
6000
7000
MVAPICH-Qlogic-DDR
MVAPICH-Qlogic-QDR-PCIe2
MVAPICH-ConnectX-DDR
MVAPICH-ConnectX-QDR-PCIe2
Bidirectional Bandwidth
MillionBytes/sec
Message Size (bytes)
2990.1
3244.1
3642.8
5835.7
All numbers taken on 2.4 GHz Quad-core (Nehalem) Intel with IB switch
CCGrid '11 118
One-way Latency: MPI over iWARP
0
10
20
30
40
50
60
70
80
90
Chelsio (TCP/IP)
Chelsio (iWARP)
Intel-NetEffect (TCP/IP)
Intel-NetEffect (iWARP)
Message Size (bytes)
One-way Latency
Latency(us)
25.43
2.4 GHz Quad-core Intel (Clovertown) with 10GE (Fulcrum) Switch
6.88
15.47
6.25
CCGrid '11 119
Bandwidth: MPI over iWARP
0
200
400
600
800
1000
1200
1400
Message Size (bytes)
Unidirectional Bandwidth
MillionBytes/sec
839.8
1169.7
373.3
1245.0
0
500
1000
1500
2000
2500
Chelsio (TCP/IP)
Chelsio (iWARP)
Intel-NetEffect (TCP/IP)
Intel-NetEffect (iWARP)
Message Size (bytes)
MillionBytes/sec
Bidirectional Bandwidth
2260.8
855.3
2029.5
2.33 GHz Quad-core Intel (Clovertown) with 10GE (Fulcrum) Switch
647.1
CCGrid '11 120
Convergent Technologies: MPI Latency
0
10
20
30
40
50
60
Small Messages
Latency(us)
Message Size (bytes)
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
Native IB
VPI-IB
VPI-Eth
RoCE
Large Messages
Latency(us)
Message Size (bytes)
ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches
37.65
3.6
2.2
1.8
CCGrid '11 121
Convergent Technologies:
MPI Uni- and Bi-directional Bandwidth
0
200
400
600
800
1000
1200
1400
1600
Native IB
VPI-IB
VPI-Eth
RoCE
Uni-directional Bandwidth
Bandwidth(MBps)
Message Size (bytes)
0
500
1000
1500
2000
2500
3000
3500
Bi-directional Bandwidth
Bandwidth(MBps)
Message Size (bytes)
ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches
404.1
1115.7
1518.1
1518.1
1085.5
2114.9
2880.4
2825.6
• Low-level Network Performance
• Clusters with Message Passing Interface (MPI)
• Datacenters with Sockets Direct Protocol (SDP) and
TCP/IP (IPoIB)
• InfiniBand in WAN and Grid-FTP
• Cloud Computing: Hadoop and Memcached
CCGrid '11 122
Case Studies
CCGrid '11 123
IPoIB vs. SDP Architectural Models
Traditional Model Possible SDP Model
Sockets App
Sockets API
Sockets Application
Sockets API
Kernel
TCP/IP Sockets
Provider
TCP/IP Transport
Driver
IPoIB Driver
User
InfiniBand CA
Kernel
TCP/IP Sockets
Provider
TCP/IP Transport
Driver
IPoIB Driver
User
InfiniBand CA
Sockets Direct
Protocol
Kernel
Bypass
RDMA
Semantics
SDP
OS Modules
InfiniBand
Hardware
(Source: InfiniBand Trade Association)
CCGrid '11 124
SDP vs. IPoIB (IB QDR)
0
500
1000
1500
2000
2 8 32 128 512 2K 8K 32K
Bandwidth(MBps)
IPoIB-RC
IPoIB-UD
SDP
0
5
10
15
20
25
30
2 4 8 16 32 64 128 256 512 1K 2K
Latency(us)
0
500
1000
1500
2000
2500
3000
3500
4000
2 8 32 128 512 2K 8K 32K
BidirBandwidth(MBps)
SDP enables high bandwidth
(up to 15 Gbps),
low latency (6.6 µs)
• Low-level Network Performance
• Clusters with Message Passing Interface (MPI)
• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP
(IPoIB)
• InfiniBand in WAN and Grid-FTP
• Cloud Computing: Hadoop and Memcached
CCGrid '11 125
Case Studies
• Option 1: Layer-1 Optical networks
– IB standard specifies link, network and transport layers
– Can use any layer-1 (though the standard says copper and optical)
• Option 2: Link Layer Conversion Techniques
– InfiniBand-to-Ethernet conversion at the link layer: switches
available from multiple companies (e.g., Obsidian)
• Technically, it’s not conversion; it’s just tunneling (L2TP)
– InfiniBand’s network layer is IPv6 compliant
CCGrid '11 126
IB on the WAN
Features
• End-to-end guaranteed bandwidth channels
• Dynamic, in-advance, reservation and provisioning of fractional/full lambdas
• Secure control-plane for signaling
• Peering with ESnet, National Science Foundation CHEETAH, and other networks
CCGrid '11 127
UltraScience Net: Experimental Research Network Testbed
Since 2005
This and the following IB WAN slides are courtesy Dr. Nagi Rao (ORNL)
• Supports SONET OC-192 or
10GE LAN-PHY/WAN-PHY
• Idea is to make remote storage
“appear” local
• IB-WAN switch does frame
conversion
– IB standard allows per-hop credit-
based flow control
– IB-WAN switch uses large internal
buffers to allow enough credits to
fill the wire
CCGrid '11 128
IB-WAN Connectivity with Obsidian Switches
CPU CPU
System
Controller
System
Memory
HCA
Host bus
Storage
Wide-area
SONET/10GE
IB switch IB WAN
CCGrid '11 129
InfiniBand Over SONET: Obsidian Longbows RDMA
throughput measurements over USN
Linux
host
ORNL
700 miles
Linux
host
Chicago
CDCI
Seattle
CDCI
Sunnyvale
CDCI
ORNL
CDCI
longbow
IB/S
longbow
IB/S
3300 miles 4300 miles
ORNL loop -0.2 mile: 7.48Gbps
IB 4x: 8Gbps (full speed)
Host-to-host local switch:7.5Gbps
IB 4x
ORNL-Chicago loop – 1400 miles: 7.47Gbps
ORNL- Chicago - Seattle loop – 6600 miles: 7.37Gbps
ORNL – Chicago – Seattle - Sunnyvale loop – 8600 miles: 7.34Gbps
OC192
Quad-core
Dual socket
Hosts:
Dual-socket Quad-core 2 GHz AMD
Opteron, 4GB memory, 8-lane PCI-
Express slot, Dual-port Voltaire 4x
SDR HCA
CCGrid '11 130
IB over 10GE LAN-PHY and WAN-PHY
Linux
host
ORNL
700 miles
Linux
host
Seattle
CDCI
ORNL
CDCI
longbow
IB/S
longbow
IB/S
3300 miles 4300 miles
ORNL loop -0.2 mile: 7.5Gbps
IB 4x
ORNL-Chicago loop – 1400 miles: 7.49Gbps
ORNL- Chicago - Seattle loop – 6600 miles: 7.39Gbps
ORNL – Chicago – Seattle - Sunnyvale loop – 8600 miles: 7.36Gbps
OC192
Quad-core
Dual socket
Chicago
CDCI
Chicago
E300
Sunnyvale
E300
Sunnyvale
CDCI
OC 192
10GE WAN PHY
MPI over IB-WAN: Obsidian Routers
Delay (us) Distance
(km)
10 2
100 20
1000 200
10000 2000
Cluster A Cluster BWAN Link
Obsidian WAN Router Obsidian WAN Router
(Variable Delay) (Variable Delay)
MPI Bidirectional Bandwidth Impact of Encryption on Message Rate (Delay 0 ms)
Hardware encryption has no impact on performance for less communicating streams
131CCGrid '11
S. Narravula, H. Subramoni, P. Lai, R. Noronha and D. K. Panda, Performance of HPC Middleware over
InfiniBand WAN, Int'l Conference on Parallel Processing (ICPP '08), September 2008.
Communication Options in Grid
• Multiple options exist to perform data transfer on Grid
• Globus-XIO framework currently does not support IB natively
• We create the Globus-XIO ADTS driver and add native IB support
to GridFTP
Globus XIO Framework
GridFTP High Performance Computing Applications
10 GigE Network
IB Verbs IPoIB RoCE TCP/IP
Obsidian
Routers
132
CCGrid '11
Globus-XIO Framework with ADTS Driver
Globus XIO Driver #n
Data
Connection
Management
Persistent
Session
Management
Buffer &
File
Management
Data Transport Interface
InfiniBand / RoCE 10GigE/iWARP
Globus XIO
Interface
File System
User
Globus-XIO
ADTS Driver
Modern WAN
Interconnects Network
Flow ControlZero Copy Channel
Memory Registration
Globus XIO Driver #1
133CCGrid '11
134
Performance of Memory Based
Data Transfer
• Performance numbers obtained while transferring 128 GB of
aggregate data in chunks of 256 MB files
• ADTS based implementation is able to saturate the link
bandwidth
• Best performance for ADTS obtained when performing data
transfer with a network buffer of size 32 MB
0
200
400
600
800
1000
1200
0 10 100 1000
Bandwidth(MBps)
Network Delay (us)
ADTS Driver
2 MB 8 MB 32 MB 64 MB
0
200
400
600
800
1000
1200
0 10 100 1000
Bandwidth(MBps)
Network Delay (us)
UDT Driver
2 MB 8 MB 32 MB 64 MB
CCGrid '11
135
Performance of Disk Based Data Transfer
• Performance numbers obtained while transferring 128 GB of
aggregate data in chunks of 256 MB files
• Predictable as well as better performance when Disk-IO
threads assist network thread (Asynchronous Mode)
• Best performance for ADTS obtained with a circular buffer
with individual buffers of size 64 MB
0
100
200
300
400
0 10 100 1000
Bandwidth(MBps)
Network Delay (us)
Synchronous Mode
ADTS-8MB ADTS-16MB ADTS-32MB
ADTS-64MB IPoIB-64MB
0
100
200
300
400
0 10 100 1000
Bandwidth(MBps)
Network Delay (us)
Asynchronous Mode
ADTS-8MB ADTS-16MB ADTS-32MB
ADTS-64MB IPoIB-64MB
CCGrid '11
136
Application Level Performance
0
50
100
150
200
250
300
CCSM Ultra-Viz
Bandwidth(MBps)
Target Applications
ADTS IPoIB
• Application performance for FTP get
operation for disk based transfers
• Community Climate System Model
(CCSM)
• Part of Earth System Grid Project
• Transfers 160 TB of total data in
chunks of 256 MB
• Network latency - 30 ms
• Ultra-Scale Visualization (Ultra-Viz)
• Transfers files of size 2.6 GB
• Network latency - 80 ms
The ADTS driver out performs the UDT driver using IPoIB by more than
100%
H. Subramoni, P. Lai, R. Kettimuthu and D. K. Panda, High Performance Data Transfer in
Grid Environment Using GridFTP over InfiniBand, Int'l Symposium on Cluster Computing and
the Grid (CCGrid), May 2010.
136CCGrid '11
• Low-level Network Performance
• Clusters with Message Passing Interface (MPI)
• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP
(IPoIB)
• InfiniBand in WAN and Grid-FTP
• Cloud Computing: Hadoop and Memcached
CCGrid '11 137
Case Studies
A New Approach towards OFA in Cloud
Current Approach
Towards OFA in Cloud
Application
Accelerated Sockets
10 GigE or InfiniBand
Verbs / Hardware
Offload
Current Cloud
Software Design
Application
Sockets
1/10 GigE
Network
Our Approach
Towards OFA in Cloud
Application
Verbs Interface
10 GigE or InfiniBand
• Sockets not designed for high-performance
– Stream semantics often mismatch for upper layers (Memcached, Hadoop)
– Zero-copy not available for non-blocking sockets (Memcached)
• Significant consolidation in cloud system software
– Hadoop and Memcached are developer facing APIs, not sockets
– Improving Hadoop and Memcached will benefit many applications immediately!
CCGrid '11 138
Memcached Design Using Verbs
• Server and client perform a negotiation protocol
– Master thread assigns clients to appropriate worker thread
• Once a client is assigned a verbs worker thread, it can communicate directly
and is “bound” to that thread
• All other Memcached data structures are shared among RDMA and Sockets
worker threads
Sockets
Client
RDMA
Client
Master
Thread
Sockets
Worker
Thread
Verbs
Worker
Thread
Sockets
Worker
Thread
Verbs
Worker
Thread
Shared
Data
Memory
Slabs
Items
…
1
1
2
2
CCGrid '11 139
Memcached Get Latency
• Memcached Get latency
– 4 bytes – DDR: 6 us; QDR: 5 us
– 4K bytes -- DDR: 20 us; QDR:12 us
• Almost factor of four improvement over 10GE (TOE) for 4K
• We are in the process of evaluating iWARP on 10GE
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0
50
100
150
200
250
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Time(us)
Message Size
0
50
100
150
200
250
300
350
1 2 4 8 16 32 64 128256512 1K 2K 4K
Time(us)
Message Size
SDP
IPoIB
OSU Design
1G
10G-TOE
CCGrid '11 140
Memcached Get TPS
• Memcached Get transactions per second for 4 bytes
– On IB DDR about 600K/s for 16 clients
– On IB QDR 1.9M/s for 16 clients
• Almost factor of six improvement over 10GE (TOE)
• We are in the process of evaluating iWARP on 10GE
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
8 Clients 16 Clients
SDP
IPoIB
OSU Design
0
100
200
300
400
500
600
700
8 Clients 16 Clients
ThousandsofTransactionspersecond
(TPS)
SDP
10G - TOE
OSU Design
CCGrid '11 141
Hadoop: Java Communication Benchmark
• Sockets level ping-pong bandwidth test
• Java performance depends on usage of NIO (allocateDirect)
• C and Java versions of the benchmark have similar performance
• HDFS does not use direct allocated blocks or NIO on DataNode
S. Sur, H. Wang, J. Huang, X. Ouyang and D. K. Panda “Can High-Performance
Interconnects Benefit Hadoop Distributed File System?”, MASVDC „10 in conjunction with
MICRO 2010, Atlanta, GA.
0
200
400
600
800
1000
1200
1400
Bandwidth(MB/s)
Message Size (Bytes)
Bandwidth with C version
0
200
400
600
800
1000
1200
1400
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
Bandwidth(MB/s)
Message Size (Bytes)
Bandwidth Java Version
1GE
IPoIB
SDP
10GE-TOE
SDP-NIO
CCGrid '11 142
Hadoop: DFS IO Write Performance
• DFS IO included in Hadoop, measures sequential access throughput
• We have two map tasks each writing to a file of increasing size (1-10GB)
• Significant improvement with IPoIB, SDP and 10GigE
• With SSD, performance improvement is almost seven or eight fold!
• SSD benefits not seen without using high-performance interconnect!
• In-line with comment on Google keynote about I/O performance
Four Data Nodes
Using HDD and SSD
0
10
20
30
40
50
60
70
80
90
1 2 3 4 5 6 7 8 9 10
AverageWriteThroughput(MB/sec)
File Size(GB)
1GE with HDD
IGE with SSD
IPoIB with HDD
IPoIB with SSD
SDP with HDD
SDP with SSD
10GE-TOE with HDD
10GE-TOE with SSD
CCGrid '11 143
Hadoop: RandomWriter Performance
• Each map generates 1GB of random binary data and writes to HDFS
• SSD improves execution time by 50% with 1GigE for two DataNodes
• For four DataNodes, benefits are observed only with HPC interconnect
• IPoIB, SDP and 10GigE can improve performance by 59% on four DataNodes
0
100
200
300
400
500
600
700
2 4
ExecutionTime(sec)
Number of data nodes
1GE with HDD
IGE with SSD
IPoIB with HDD
IPoIB with SSD
SDP with HDD
SDP with SSD
10GE-TOE with HDD
10GE-TOE with SSD
CCGrid '11 144
Hadoop Sort Benchmark
• Sort: baseline benchmark for Hadoop
• Sort phase: I/O bound; Reduce phase: communication bound
• SSD improves performance by 28% using 1GigE with two DataNodes
• Benefit of 50% on four DataNodes using SDP, IPoIB or 10GigE
0
500
1000
1500
2000
2500
2 4
ExecutionTime(sec)
Number of data nodes
1GE with HDD
IGE with SSD
IPoIB with HDD
IPoIB with SSD
SDP with HDD
SDP with SSD
10GE-TOE with HDD
10GE-TOE with SSD
CCGrid '11 145
• Introduction
• Why InfiniBand and High-speed Ethernet?
• Overview of IB, HSE, their Convergence and Features
• IB and HSE HW/SW Products and Installations
• Sample Case Studies and Performance Numbers
• Conclusions and Final Q&A
CCGrid '11
Presentation Overview
146
• Presented network architectures & trends for Clusters,
Grid, Multi-tier Datacenters and Cloud Computing Systems
• Presented background and details of IB and HSE
– Highlighted the main features of IB and HSE and their convergence
– Gave an overview of IB and HSE hardware/software products
– Discussed sample performance numbers in designing various high-
end systems with IB and HSE
• IB and HSE are emerging as new architectures leading to a
new generation of networked computing systems, opening
many research issues needing novel solutions
CCGrid '11
Concluding Remarks
147
CCGrid '11
Funding Acknowledgments
Funding Support by
Equipment Support by
148
CCGrid '11
Personnel Acknowledgments
Current Students
– N. Dandapanthula (M.S.)
– R. Darbha (M.S.)
– V. Dhanraj (M.S.)
– J. Huang (Ph.D.)
– J. Jose (Ph.D.)
– K. Kandalla (Ph.D.)
– M. Luo (Ph.D.)
– V. Meshram (M.S.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– R. Rajachandrasekhar (Ph.D.)
– A. Singh (Ph.D.)
– H. Subramoni (Ph.D.)
Past Students
– P. Balaji (Ph.D.)
– D. Buntinas (Ph.D.)
– S. Bhagvat (M.S.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– P. Lai (Ph. D.)
– J. Liu (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– G. Santhanaraman (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
149
Current Research Scientist
– S. Sur
Current Post-Docs
– H. Wang
– J. Vienne
– X. Besseron
Current Programmers
– M. Arnold
– D. Bureddy
– J. Perkins
Past Post-Docs
– E. Mancini
– S. Marcarelli
– H.-W. Jin
CCGrid '11
Web Pointers
http://www.cse.ohio-state.edu/~panda
http://www.cse.ohio-state.edu/~surs
http://nowlab.cse.ohio-state.edu
MVAPICH Web Page
http://mvapich.cse.ohio-state.edu
panda@cse.ohio-state.edu
surs@cse.ohio-state.edu
150

More Related Content

What's hot

Mellanox Storage Solutions
Mellanox Storage SolutionsMellanox Storage Solutions
Mellanox Storage Solutions
Mellanox Technologies
 
SDN Network virtualization, NFV & MPLS synergies
SDN Network virtualization, NFV & MPLS synergiesSDN Network virtualization, NFV & MPLS synergies
SDN Network virtualization, NFV & MPLS synergies
Hector.Avalos
 
Big Data Benchmarking with RDMA solutions
Big Data Benchmarking with RDMA solutions Big Data Benchmarking with RDMA solutions
Big Data Benchmarking with RDMA solutions
Mellanox Technologies
 
Scalable Service-Oriented Middleware over IP
Scalable Service-Oriented Middleware over IPScalable Service-Oriented Middleware over IP
Scalable Service-Oriented Middleware over IP
Dai Yang
 
Panel with IPv6 CE Vendors
Panel with IPv6 CE VendorsPanel with IPv6 CE Vendors
Panel with IPv6 CE Vendors
APNIC
 
Insights on the configuration and performances of SOME/IP Service Discovery
Insights on the configuration and performances of SOME/IP Service DiscoveryInsights on the configuration and performances of SOME/IP Service Discovery
Insights on the configuration and performances of SOME/IP Service Discovery
Nicolas Navet
 
Scalable Service Oriented Architecture for Audio/Video ...
Scalable Service Oriented Architecture for Audio/Video ...Scalable Service Oriented Architecture for Audio/Video ...
Scalable Service Oriented Architecture for Audio/Video ...
Videoguy
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Michelle Holley
 
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
Indonesia Network Operators Group
 
Cloud Networking Trends
Cloud Networking TrendsCloud Networking Trends
Cloud Networking Trends
Michelle Holley
 
Tech Tutorial by Vikram Dham: Let's build MPLS router using SDN
Tech Tutorial by Vikram Dham: Let's build MPLS router using SDNTech Tutorial by Vikram Dham: Let's build MPLS router using SDN
Tech Tutorial by Vikram Dham: Let's build MPLS router using SDN
nvirters
 
Introduction to SDN and Network Programmability - BRKRST-1014 | 2017/Las Vegas
Introduction to SDN and Network Programmability - BRKRST-1014 | 2017/Las VegasIntroduction to SDN and Network Programmability - BRKRST-1014 | 2017/Las Vegas
Introduction to SDN and Network Programmability - BRKRST-1014 | 2017/Las Vegas
Bruno Teixeira
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane
Michelle Holley
 
Pag 1002
Pag 1002Pag 1002
Pag 1002
Jeevan Yadav
 
Development, test, and characterization of MEC platforms with Teranium and Dr...
Development, test, and characterization of MEC platforms with Teranium and Dr...Development, test, and characterization of MEC platforms with Teranium and Dr...
Development, test, and characterization of MEC platforms with Teranium and Dr...
Michelle Holley
 
NFV Linaro Connect Keynote
NFV Linaro Connect KeynoteNFV Linaro Connect Keynote
NFV Linaro Connect Keynote
Linaro
 
Multi-operator "IPC" VPN Slices: Applying RINA to Overlay Networking
Multi-operator "IPC" VPN Slices: Applying RINA to Overlay NetworkingMulti-operator "IPC" VPN Slices: Applying RINA to Overlay Networking
Multi-operator "IPC" VPN Slices: Applying RINA to Overlay Networking
ARCFIRE ICT
 
Tech Talk by John Casey (CTO) CPLANE_NETWORKS : High Performance OpenStack Ne...
Tech Talk by John Casey (CTO) CPLANE_NETWORKS : High Performance OpenStack Ne...Tech Talk by John Casey (CTO) CPLANE_NETWORKS : High Performance OpenStack Ne...
Tech Talk by John Casey (CTO) CPLANE_NETWORKS : High Performance OpenStack Ne...
nvirters
 
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Michelle Holley
 
Virt july-2013-meetup
Virt july-2013-meetupVirt july-2013-meetup
Virt july-2013-meetup
nvirters
 

What's hot (20)

Mellanox Storage Solutions
Mellanox Storage SolutionsMellanox Storage Solutions
Mellanox Storage Solutions
 
SDN Network virtualization, NFV & MPLS synergies
SDN Network virtualization, NFV & MPLS synergiesSDN Network virtualization, NFV & MPLS synergies
SDN Network virtualization, NFV & MPLS synergies
 
Big Data Benchmarking with RDMA solutions
Big Data Benchmarking with RDMA solutions Big Data Benchmarking with RDMA solutions
Big Data Benchmarking with RDMA solutions
 
Scalable Service-Oriented Middleware over IP
Scalable Service-Oriented Middleware over IPScalable Service-Oriented Middleware over IP
Scalable Service-Oriented Middleware over IP
 
Panel with IPv6 CE Vendors
Panel with IPv6 CE VendorsPanel with IPv6 CE Vendors
Panel with IPv6 CE Vendors
 
Insights on the configuration and performances of SOME/IP Service Discovery
Insights on the configuration and performances of SOME/IP Service DiscoveryInsights on the configuration and performances of SOME/IP Service Discovery
Insights on the configuration and performances of SOME/IP Service Discovery
 
Scalable Service Oriented Architecture for Audio/Video ...
Scalable Service Oriented Architecture for Audio/Video ...Scalable Service Oriented Architecture for Audio/Video ...
Scalable Service Oriented Architecture for Audio/Video ...
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
 
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
 
Cloud Networking Trends
Cloud Networking TrendsCloud Networking Trends
Cloud Networking Trends
 
Tech Tutorial by Vikram Dham: Let's build MPLS router using SDN
Tech Tutorial by Vikram Dham: Let's build MPLS router using SDNTech Tutorial by Vikram Dham: Let's build MPLS router using SDN
Tech Tutorial by Vikram Dham: Let's build MPLS router using SDN
 
Introduction to SDN and Network Programmability - BRKRST-1014 | 2017/Las Vegas
Introduction to SDN and Network Programmability - BRKRST-1014 | 2017/Las VegasIntroduction to SDN and Network Programmability - BRKRST-1014 | 2017/Las Vegas
Introduction to SDN and Network Programmability - BRKRST-1014 | 2017/Las Vegas
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane
 
Pag 1002
Pag 1002Pag 1002
Pag 1002
 
Development, test, and characterization of MEC platforms with Teranium and Dr...
Development, test, and characterization of MEC platforms with Teranium and Dr...Development, test, and characterization of MEC platforms with Teranium and Dr...
Development, test, and characterization of MEC platforms with Teranium and Dr...
 
NFV Linaro Connect Keynote
NFV Linaro Connect KeynoteNFV Linaro Connect Keynote
NFV Linaro Connect Keynote
 
Multi-operator "IPC" VPN Slices: Applying RINA to Overlay Networking
Multi-operator "IPC" VPN Slices: Applying RINA to Overlay NetworkingMulti-operator "IPC" VPN Slices: Applying RINA to Overlay Networking
Multi-operator "IPC" VPN Slices: Applying RINA to Overlay Networking
 
Tech Talk by John Casey (CTO) CPLANE_NETWORKS : High Performance OpenStack Ne...
Tech Talk by John Casey (CTO) CPLANE_NETWORKS : High Performance OpenStack Ne...Tech Talk by John Casey (CTO) CPLANE_NETWORKS : High Performance OpenStack Ne...
Tech Talk by John Casey (CTO) CPLANE_NETWORKS : High Performance OpenStack Ne...
 
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
 
Virt july-2013-meetup
Virt july-2013-meetupVirt july-2013-meetup
Virt july-2013-meetup
 

Similar to Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed Ethernet

Connectivity for the data centric era
Connectivity for the data centric eraConnectivity for the data centric era
Connectivity for the data centric era
DESMOND YUEN
 
G rpc talk with intel (3)
G rpc talk with intel (3)G rpc talk with intel (3)
G rpc talk with intel (3)
Intel
 
Apresentação ccna en_SWITCH_v6_Ch01.pptx
Apresentação ccna en_SWITCH_v6_Ch01.pptxApresentação ccna en_SWITCH_v6_Ch01.pptx
Apresentação ccna en_SWITCH_v6_Ch01.pptx
rodrigomateus007
 
Intels presentation at blue line industrial computer seminar
Intels presentation at blue line industrial computer seminarIntels presentation at blue line industrial computer seminar
Intels presentation at blue line industrial computer seminar
Blue Line
 
Open Networking
Open NetworkingOpen Networking
Open Networking
Tal Lavian Ph.D.
 
Accelerating Innovation from Edge to Cloud
Accelerating Innovation from Edge to CloudAccelerating Innovation from Edge to Cloud
Accelerating Innovation from Edge to Cloud
Rebekah Rodriguez
 
Colt's SDN/NFV Vision
Colt's SDN/NFV VisionColt's SDN/NFV Vision
Colt's SDN/NFV Vision
FIBRE Testbed
 
Colt SDN Strategy - FIBRE Workshop 5 Nov 2013 Barcelona
Colt SDN Strategy - FIBRE Workshop 5 Nov 2013 BarcelonaColt SDN Strategy - FIBRE Workshop 5 Nov 2013 Barcelona
Colt SDN Strategy - FIBRE Workshop 5 Nov 2013 Barcelona
Javier Benitez
 
E s switch_v6_ch01
E s switch_v6_ch01E s switch_v6_ch01
E s switch_v6_ch01
gon77gonzalez
 
Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...
Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...
Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...
Deepak Shankar
 
Resilient Network Design Concepts Educat
Resilient Network Design Concepts EducatResilient Network Design Concepts Educat
Resilient Network Design Concepts Educat
SamGrandprix
 
Task allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed systemTask allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed system
Deepak Shankar
 
M1-C17-Armando una red.pptx
M1-C17-Armando una red.pptxM1-C17-Armando una red.pptx
M1-C17-Armando una red.pptx
Angel Garcia
 
17 - Building small network.pdf
17 - Building small network.pdf17 - Building small network.pdf
17 - Building small network.pdf
PhiliphaHaldline
 
Tackling Retail Technology Management Challenges at the Edge
Tackling Retail Technology Management Challenges at the EdgeTackling Retail Technology Management Challenges at the Edge
Tackling Retail Technology Management Challenges at the Edge
Rebekah Rodriguez
 
PacketCloud: an Open Platform for Elastic In-network Services.
PacketCloud: an Open Platform for Elastic In-network Services. PacketCloud: an Open Platform for Elastic In-network Services.
PacketCloud: an Open Platform for Elastic In-network Services.
yeung2000
 
Developers’ mDay u Banjoj Luci - Janko Isidorović, Mainflux – Unified IoT Pl...
Developers’ mDay u Banjoj Luci - Janko Isidorović, Mainflux –  Unified IoT Pl...Developers’ mDay u Banjoj Luci - Janko Isidorović, Mainflux –  Unified IoT Pl...
Developers’ mDay u Banjoj Luci - Janko Isidorović, Mainflux – Unified IoT Pl...
mCloud
 
IPv6 and IoT
IPv6 and IoTIPv6 and IoT
IPv6 and IoT
APNIC
 
Introduction to Fog
Introduction to FogIntroduction to Fog
Introduction to Fog
Cisco DevNet
 
Cisco Multi-Service FAN Solution
Cisco Multi-Service FAN SolutionCisco Multi-Service FAN Solution
Cisco Multi-Service FAN Solution
Cisco DevNet
 

Similar to Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed Ethernet (20)

Connectivity for the data centric era
Connectivity for the data centric eraConnectivity for the data centric era
Connectivity for the data centric era
 
G rpc talk with intel (3)
G rpc talk with intel (3)G rpc talk with intel (3)
G rpc talk with intel (3)
 
Apresentação ccna en_SWITCH_v6_Ch01.pptx
Apresentação ccna en_SWITCH_v6_Ch01.pptxApresentação ccna en_SWITCH_v6_Ch01.pptx
Apresentação ccna en_SWITCH_v6_Ch01.pptx
 
Intels presentation at blue line industrial computer seminar
Intels presentation at blue line industrial computer seminarIntels presentation at blue line industrial computer seminar
Intels presentation at blue line industrial computer seminar
 
Open Networking
Open NetworkingOpen Networking
Open Networking
 
Accelerating Innovation from Edge to Cloud
Accelerating Innovation from Edge to CloudAccelerating Innovation from Edge to Cloud
Accelerating Innovation from Edge to Cloud
 
Colt's SDN/NFV Vision
Colt's SDN/NFV VisionColt's SDN/NFV Vision
Colt's SDN/NFV Vision
 
Colt SDN Strategy - FIBRE Workshop 5 Nov 2013 Barcelona
Colt SDN Strategy - FIBRE Workshop 5 Nov 2013 BarcelonaColt SDN Strategy - FIBRE Workshop 5 Nov 2013 Barcelona
Colt SDN Strategy - FIBRE Workshop 5 Nov 2013 Barcelona
 
E s switch_v6_ch01
E s switch_v6_ch01E s switch_v6_ch01
E s switch_v6_ch01
 
Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...
Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...
Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...
 
Resilient Network Design Concepts Educat
Resilient Network Design Concepts EducatResilient Network Design Concepts Educat
Resilient Network Design Concepts Educat
 
Task allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed systemTask allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed system
 
M1-C17-Armando una red.pptx
M1-C17-Armando una red.pptxM1-C17-Armando una red.pptx
M1-C17-Armando una red.pptx
 
17 - Building small network.pdf
17 - Building small network.pdf17 - Building small network.pdf
17 - Building small network.pdf
 
Tackling Retail Technology Management Challenges at the Edge
Tackling Retail Technology Management Challenges at the EdgeTackling Retail Technology Management Challenges at the Edge
Tackling Retail Technology Management Challenges at the Edge
 
PacketCloud: an Open Platform for Elastic In-network Services.
PacketCloud: an Open Platform for Elastic In-network Services. PacketCloud: an Open Platform for Elastic In-network Services.
PacketCloud: an Open Platform for Elastic In-network Services.
 
Developers’ mDay u Banjoj Luci - Janko Isidorović, Mainflux – Unified IoT Pl...
Developers’ mDay u Banjoj Luci - Janko Isidorović, Mainflux –  Unified IoT Pl...Developers’ mDay u Banjoj Luci - Janko Isidorović, Mainflux –  Unified IoT Pl...
Developers’ mDay u Banjoj Luci - Janko Isidorović, Mainflux – Unified IoT Pl...
 
IPv6 and IoT
IPv6 and IoTIPv6 and IoT
IPv6 and IoT
 
Introduction to Fog
Introduction to FogIntroduction to Fog
Introduction to Fog
 
Cisco Multi-Service FAN Solution
Cisco Multi-Service FAN SolutionCisco Multi-Service FAN Solution
Cisco Multi-Service FAN Solution
 

More from Mason Mei

Brkdcn 2035 multi-x
Brkdcn 2035 multi-xBrkdcn 2035 multi-x
Brkdcn 2035 multi-x
Mason Mei
 
Ovn vancouver
Ovn vancouverOvn vancouver
Ovn vancouver
Mason Mei
 
11 zhuai chuanpu h3 c
11 zhuai chuanpu h3 c11 zhuai chuanpu h3 c
11 zhuai chuanpu h3 c
Mason Mei
 
10 2013 sdn summit ch reviewed-new
10 2013 sdn summit ch reviewed-new10 2013 sdn summit ch reviewed-new
10 2013 sdn summit ch reviewed-new
Mason Mei
 
08 sdn system intelligence short public beijing sdn conference - 130828
08 sdn system intelligence   short public beijing sdn conference - 13082808 sdn system intelligence   short public beijing sdn conference - 130828
08 sdn system intelligence short public beijing sdn conference - 130828
Mason Mei
 
07 tang xiongyan
07 tang xiongyan07 tang xiongyan
07 tang xiongyan
Mason Mei
 
06 duan xiaodong
06 duan xiaodong06 duan xiaodong
06 duan xiaodong
Mason Mei
 
05 zhao huiling
05 zhao huiling05 zhao huiling
05 zhao huiling
Mason Mei
 
04 hou ziqiang
04 hou ziqiang04 hou ziqiang
04 hou ziqiang
Mason Mei
 
03 jiang lintao
03 jiang lintao03 jiang lintao
03 jiang lintao
Mason Mei
 
02 china sdn conf ron keynote
02 china sdn conf ron keynote02 china sdn conf ron keynote
02 china sdn conf ron keynote
Mason Mei
 
01 dan chinese-chinese sdn china 2013- dan's keynote draft aug 14 2013
01 dan chinese-chinese sdn china 2013- dan's keynote draft aug 14 201301 dan chinese-chinese sdn china 2013- dan's keynote draft aug 14 2013
01 dan chinese-chinese sdn china 2013- dan's keynote draft aug 14 2013
Mason Mei
 
New idc architecture
New idc architectureNew idc architecture
New idc architecture
Mason Mei
 
H3 cswitch2015
H3 cswitch2015H3 cswitch2015
H3 cswitch2015
Mason Mei
 
201507131408448146
201507131408448146201507131408448146
201507131408448146
Mason Mei
 
16 vxlan配置指导-整本手册
16 vxlan配置指导-整本手册16 vxlan配置指导-整本手册
16 vxlan配置指导-整本手册
Mason Mei
 
Atf 3 q15-8 - introducing macro-segementation
Atf 3 q15-8 - introducing macro-segementationAtf 3 q15-8 - introducing macro-segementation
Atf 3 q15-8 - introducing macro-segementation
Mason Mei
 
Atf 3 q15-9 - summary and close
Atf 3 q15-9 - summary and closeAtf 3 q15-9 - summary and close
Atf 3 q15-9 - summary and close
Mason Mei
 
Atf 3 q15-7 - delivering cloud scale workflow automation control and visibili...
Atf 3 q15-7 - delivering cloud scale workflow automation control and visibili...Atf 3 q15-7 - delivering cloud scale workflow automation control and visibili...
Atf 3 q15-7 - delivering cloud scale workflow automation control and visibili...
Mason Mei
 
Atf 3 q15-4 - scaling the the software driven cloud network
Atf 3 q15-4 - scaling the the software driven cloud networkAtf 3 q15-4 - scaling the the software driven cloud network
Atf 3 q15-4 - scaling the the software driven cloud network
Mason Mei
 

More from Mason Mei (20)

Brkdcn 2035 multi-x
Brkdcn 2035 multi-xBrkdcn 2035 multi-x
Brkdcn 2035 multi-x
 
Ovn vancouver
Ovn vancouverOvn vancouver
Ovn vancouver
 
11 zhuai chuanpu h3 c
11 zhuai chuanpu h3 c11 zhuai chuanpu h3 c
11 zhuai chuanpu h3 c
 
10 2013 sdn summit ch reviewed-new
10 2013 sdn summit ch reviewed-new10 2013 sdn summit ch reviewed-new
10 2013 sdn summit ch reviewed-new
 
08 sdn system intelligence short public beijing sdn conference - 130828
08 sdn system intelligence   short public beijing sdn conference - 13082808 sdn system intelligence   short public beijing sdn conference - 130828
08 sdn system intelligence short public beijing sdn conference - 130828
 
07 tang xiongyan
07 tang xiongyan07 tang xiongyan
07 tang xiongyan
 
06 duan xiaodong
06 duan xiaodong06 duan xiaodong
06 duan xiaodong
 
05 zhao huiling
05 zhao huiling05 zhao huiling
05 zhao huiling
 
04 hou ziqiang
04 hou ziqiang04 hou ziqiang
04 hou ziqiang
 
03 jiang lintao
03 jiang lintao03 jiang lintao
03 jiang lintao
 
02 china sdn conf ron keynote
02 china sdn conf ron keynote02 china sdn conf ron keynote
02 china sdn conf ron keynote
 
01 dan chinese-chinese sdn china 2013- dan's keynote draft aug 14 2013
01 dan chinese-chinese sdn china 2013- dan's keynote draft aug 14 201301 dan chinese-chinese sdn china 2013- dan's keynote draft aug 14 2013
01 dan chinese-chinese sdn china 2013- dan's keynote draft aug 14 2013
 
New idc architecture
New idc architectureNew idc architecture
New idc architecture
 
H3 cswitch2015
H3 cswitch2015H3 cswitch2015
H3 cswitch2015
 
201507131408448146
201507131408448146201507131408448146
201507131408448146
 
16 vxlan配置指导-整本手册
16 vxlan配置指导-整本手册16 vxlan配置指导-整本手册
16 vxlan配置指导-整本手册
 
Atf 3 q15-8 - introducing macro-segementation
Atf 3 q15-8 - introducing macro-segementationAtf 3 q15-8 - introducing macro-segementation
Atf 3 q15-8 - introducing macro-segementation
 
Atf 3 q15-9 - summary and close
Atf 3 q15-9 - summary and closeAtf 3 q15-9 - summary and close
Atf 3 q15-9 - summary and close
 
Atf 3 q15-7 - delivering cloud scale workflow automation control and visibili...
Atf 3 q15-7 - delivering cloud scale workflow automation control and visibili...Atf 3 q15-7 - delivering cloud scale workflow automation control and visibili...
Atf 3 q15-7 - delivering cloud scale workflow automation control and visibili...
 
Atf 3 q15-4 - scaling the the software driven cloud network
Atf 3 q15-4 - scaling the the software driven cloud networkAtf 3 q15-4 - scaling the the software driven cloud network
Atf 3 q15-4 - scaling the the software driven cloud network
 

Recently uploaded

“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 

Recently uploaded (20)

“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 

Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed Ethernet

  • 1. Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed Ethernet Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda A Tutorial at CCGrid ’11 by Sayantan Sur The Ohio State University E-mail: surs@cse.ohio-state.edu http://www.cse.ohio-state.edu/~surs
  • 2. • Introduction • Why InfiniBand and High-speed Ethernet? • Overview of IB, HSE, their Convergence and Features • IB and HSE HW/SW Products and Installations • Sample Case Studies and Performance Numbers • Conclusions and Final Q&A CCGrid '11 Presentation Overview 2
  • 3. CCGrid '11 Current and Next Generation Applications and Computing Systems 3 • Diverse Range of Applications – Processing and dataset characteristics vary • Growth of High Performance Computing – Growth in processor performance • Chip density doubles every 18 months – Growth in commodity networking • Increase in speed/features + reducing cost • Different Kinds of Systems – Clusters, Grid, Cloud, Datacenters, …..
  • 4. CCGrid '11 Cluster Computing Environment Compute cluster LANFrontend Meta-Data Manager I/O Server Node Meta Data Data Compute Node Compute Node I/O Server Node Data Compute Node I/O Server Node Data Compute Node L A N Storage cluster LAN 4
  • 5. CCGrid '11 Trends for Computing Clusters in the Top 500 List (http://www.top500.org) Nov. 1996: 0/500 (0%) Nov. 2001: 43/500 (8.6%) Nov. 2006: 361/500 (72.2%) Jun. 1997: 1/500 (0.2%) Jun. 2002: 80/500 (16%) Jun. 2007: 373/500 (74.6%) Nov. 1997: 1/500 (0.2%) Nov. 2002: 93/500 (18.6%) Nov. 2007: 406/500 (81.2%) Jun. 1998: 1/500 (0.2%) Jun. 2003: 149/500 (29.8%) Jun. 2008: 400/500 (80.0%) Nov. 1998: 2/500 (0.4%) Nov. 2003: 208/500 (41.6%) Nov. 2008: 410/500 (82.0%) Jun. 1999: 6/500 (1.2%) Jun. 2004: 291/500 (58.2%) Jun. 2009: 410/500 (82.0%) Nov. 1999: 7/500 (1.4%) Nov. 2004: 294/500 (58.8%) Nov. 2009: 417/500 (83.4%) Jun. 2000: 11/500 (2.2%) Jun. 2005: 304/500 (60.8%) Jun. 2010: 424/500 (84.8%) Nov. 2000: 28/500 (5.6%) Nov. 2005: 360/500 (72.0%) Nov. 2010: 415/500 (83%) Jun. 2001: 33/500 (6.6%) Jun. 2006: 364/500 (72.8%) Jun. 2011: To be announced 5
  • 6. CCGrid '11 Grid Computing Environment 6 Compute cluster LAN Frontend Meta-Data Manager I/O Server Node Meta Data Data Compute Node Compute Node I/O Server Node Data Compute Node I/O Server Node Data Compute Node L A N Storage cluster LAN Compute cluster LAN Frontend Meta-Data Manager I/O Server Node Meta Data Data Compute Node Compute Node I/O Server Node Data Compute Node I/O Server Node Data Compute Node L A N Storage cluster LAN Compute cluster LAN Frontend Meta-Data Manager I/O Server Node Meta Data Data Compute Node Compute Node I/O Server Node Data Compute Node I/O Server Node Data Compute Node L A N Storage cluster LAN WAN
  • 7. CCGrid '11 Multi-Tier Datacenters and Enterprise Computing 7 . . . . Enterprise Multi-tier Datacenter Tier1 Tier3 Routers/ Servers Switch Database Server Application Server Routers/ Servers Routers/ Servers Application Server Application Server Application Server Database Server Database Server Database Server Switch Switch Routers/ Servers Tier2
  • 8. CCGrid '11 Integrated High-End Computing Environments Compute cluster Meta-Data Manager I/O Server Node Meta Data Data Compute Node Compute Node I/O Server Node Data Compute Node I/O Server Node Data Compute Node L A NLANFrontend Storage cluster LAN/WAN . . . . Enterprise Multi-tier Datacenter for Visualization and Mining Tier1 Tier3 Routers/ Servers Switch Database Server Application Server Routers/ Servers Routers/ Servers Application Server Application Server Application Server Database Server Database Server Database Server Switch Switch Routers/ Servers Tier2 8
  • 9. CCGrid '11 Cloud Computing Environments 9 LAN Physical Machine VM VM Physical Machine VM VM Physical Machine VM VM VirtualFS Meta-Data Meta Data I/O Server Data I/O Server Data I/O Server Data I/O Server Data Physical Machine VM VM Local Storage Local Storage Local Storage Local Storage
  • 10. Hadoop Architecture • Underlying Hadoop Distributed File System (HDFS) • Fault-tolerance by replicating data blocks • NameNode: stores information on data blocks • DataNodes: store blocks and host Map-reduce computation • JobTracker: track jobs and detect failure • Model scales but high amount of communication during intermediate phases 10CCGrid '11
  • 11. Memcached Architecture • Distributed Caching Layer – Allows to aggregate spare memory from multiple nodes – General purpose • Typically used to cache database queries, results of API calls • Scalable model, but typical usage very network intensive 11 Internet Web-frontend Servers Memcached Servers Database Servers System Area Network System Area Network CCGrid '11
  • 12. • Good System Area Networks with excellent performance (low latency, high bandwidth and low CPU utilization) for inter-processor communication (IPC) and I/O • Good Storage Area Networks high performance I/O • Good WAN connectivity in addition to intra-cluster SAN/LAN connectivity • Quality of Service (QoS) for interactive applications • RAS (Reliability, Availability, and Serviceability) • With low cost CCGrid '11 Networking and I/O Requirements 12
  • 13. • Hardware components – Processing cores and memory subsystem – I/O bus or links – Network adapters/switches • Software components – Communication stack • Bottlenecks can artificially limit the network performance the user perceives CCGrid '11 13 Major Components in Computing Systems P0 Core0 Core1 Core2 Core3 P1 Core0 Core1 Core2 Core3 Memory MemoryI / O B u s Network Adapter Network Switch Processing Bottlenecks I/O Interface Bottlenecks Network Bottlenecks
  • 14. • Ex: TCP/IP, UDP/IP • Generic architecture for all networks • Host processor handles almost all aspects of communication – Data buffering (copies on sender and receiver) – Data integrity (checksum) – Routing aspects (IP routing) • Signaling between different layers – Hardware interrupt on packet arrival or transmission – Software signals between different layers to handle protocol processing in different priority levels CCGrid '11 Processing Bottlenecks in Traditional Protocols 14 P0 Core0 Core1 Core2 Core3 P1 Core0 Core1 Core2 Core3 Memory MemoryI / O B u s Network Adapter Network Switch Processing Bottlenecks
  • 15. • Traditionally relied on bus-based technologies (last mile bottleneck) – E.g., PCI, PCI-X – One bit per wire – Performance increase through: • Increasing clock speed • Increasing bus width – Not scalable: • Cross talk between bits • Skew between wires • Signal integrity makes it difficult to increase bus width significantly, especially for high clock speeds CCGrid '11 Bottlenecks in Traditional I/O Interfaces and Networks 15 PCI 1990 33MHz/32bit: 1.05Gbps (shared bidirectional) PCI-X 1998 (v1.0) 2003 (v2.0) 133MHz/64bit: 8.5Gbps (shared bidirectional) 266-533MHz/64bit: 17Gbps (shared bidirectional) P0 Core0 Core1 Core2 Core3 P1 Core0 Core1 Core2 Core3 Memory MemoryI / O B u s Network Adapter Network Switch I/O Interface Bottlenecks
  • 16. • Network speeds saturated at around 1Gbps – Features provided were limited – Commodity networks were not considered scalable enough for very large-scale systems CCGrid '11 16 Bottlenecks on Traditional Networks Ethernet (1979 - ) 10 Mbit/sec Fast Ethernet (1993 -) 100 Mbit/sec Gigabit Ethernet (1995 -) 1000 Mbit /sec ATM (1995 -) 155/622/1024 Mbit/sec Myrinet (1993 -) 1 Gbit/sec Fibre Channel (1994 -) 1 Gbit/sec P0 Core0 Core1 Core2 Core3 P1 Core0 Core1 Core2 Core3 Memory MemoryI / O B u s Network Adapter Network Switch Network Bottlenecks
  • 17. • Industry Networking Standards • InfiniBand and High-speed Ethernet were introduced into the market to address these bottlenecks • InfiniBand aimed at all three bottlenecks (protocol processing, I/O bus, and network speed) • Ethernet aimed at directly handling the network speed bottleneck and relying on complementary technologies to alleviate the protocol processing and I/O bus bottlenecks CCGrid '11 17 Motivation for InfiniBand and High-speed Ethernet
  • 18. • Introduction • Why InfiniBand and High-speed Ethernet? • Overview of IB, HSE, their Convergence and Features • IB and HSE HW/SW Products and Installations • Sample Case Studies and Performance Numbers • Conclusions and Final Q&A CCGrid '11 Presentation Overview 18
  • 19. • IB Trade Association was formed with seven industry leaders (Compaq, Dell, HP, IBM, Intel, Microsoft, and Sun) • Goal: To design a scalable and high performance communication and I/O architecture by taking an integrated view of computing, networking, and storage technologies • Many other industry participated in the effort to define the IB architecture specification • IB Architecture (Volume 1, Version 1.0) was released to public on Oct 24, 2000 – Latest version 1.2.1 released January 2008 • http://www.infinibandta.org CCGrid '11 IB Trade Association 19
  • 20. • 10GE Alliance formed by several industry leaders to take the Ethernet family to the next speed step • Goal: To achieve a scalable and high performance communication architecture while maintaining backward compatibility with Ethernet • http://www.ethernetalliance.org • 40-Gbps (Servers) and 100-Gbps Ethernet (Backbones, Switches, Routers): IEEE 802.3 WG • Energy-efficient and power-conscious protocols – On-the-fly link speed reduction for under-utilized links CCGrid '11 High-speed Ethernet Consortium (10GE/40GE/100GE) 20
  • 21. • Network speed bottlenecks • Protocol processing bottlenecks • I/O interface bottlenecks CCGrid '11 21 Tackling Communication Bottlenecks with IB and HSE
  • 22. • Bit serial differential signaling – Independent pairs of wires to transmit independent data (called a lane) – Scalable to any number of lanes – Easy to increase clock speed of lanes (since each lane consists only of a pair of wires) • Theoretically, no perceived limit on the bandwidth CCGrid '11 Network Bottleneck Alleviation: InfiniBand (“Infinite Bandwidth”) and High-speed Ethernet (10/40/100 GE) 22
  • 23. CCGrid '11 Network Speed Acceleration with IB and HSE Ethernet (1979 - ) 10 Mbit/sec Fast Ethernet (1993 -) 100 Mbit/sec Gigabit Ethernet (1995 -) 1000 Mbit /sec ATM (1995 -) 155/622/1024 Mbit/sec Myrinet (1993 -) 1 Gbit/sec Fibre Channel (1994 -) 1 Gbit/sec InfiniBand (2001 -) 2 Gbit/sec (1X SDR) 10-Gigabit Ethernet (2001 -) 10 Gbit/sec InfiniBand (2003 -) 8 Gbit/sec (4X SDR) InfiniBand (2005 -) 16 Gbit/sec (4X DDR) 24 Gbit/sec (12X SDR) InfiniBand (2007 -) 32 Gbit/sec (4X QDR) 40-Gigabit Ethernet (2010 -) 40 Gbit/sec InfiniBand (2011 -) 56 Gbit/sec (4X FDR) InfiniBand (2012 -) 100 Gbit/sec (4X EDR) 20 times in the last 9 years 23
  • 24. 2005 - 2006 - 2007 - 2008 - 2009 - 2010 - 2011 Bandwidthperdirection(Gbps) 32G-IB-DDR 48G-IB-DDR 96G-IB-QDR 48G-IB-QDR 200G-IB-EDR 112G-IB-FDR 300G-IB-EDR 168G-IB-FDR 8x HDR 12x HDR # of Lanes per direction Per Lane & Rounded Per Link Bandwidth (Gb/s) 4G-IB DDR 8G-IB QDR 14G-IB-FDR (14.025) 26G-IB-EDR (25.78125) 12 48+48 96+96 168+168 300+300 8 32+32 64+64 112+112 200+200 4 16+16 32+32 56+56 100+100 1 4+4 8+8 14+14 25+25 NDR = Next Data Rate HDR = High Data Rate EDR = Enhanced Data Rate FDR = Fourteen Data Rate QDR = Quad Data Rate DDR = Double Data Rate SDR = Single Data Rate (not shown) x12 x8 x1 2015 12x NDR 8x NDR 32G-IB-QDR 100G-IB-EDR 56G-IB-FDR 16G-IB-DDR 4x HDR x4 4x NDR 8G-IB-QDR 25G-IB-EDR 14G-IB-FDR 1x HDR 1x NDR InfiniBand Link Speed Standardization Roadmap 24CCGrid '11
  • 25. • Network speed bottlenecks • Protocol processing bottlenecks • I/O interface bottlenecks CCGrid '11 25 Tackling Communication Bottlenecks with IB and HSE
  • 26. • Intelligent Network Interface Cards • Support entire protocol processing completely in hardware (hardware protocol offload engines) • Provide a rich communication interface to applications – User-level communication capability – Gets rid of intermediate data buffering requirements • No software signaling between communication layers – All layers are implemented on a dedicated hardware unit, and not on a shared host CPU CCGrid '11 Capabilities of High-Performance Networks 26
  • 27. • Fast Messages (FM) – Developed by UIUC • Myricom GM – Proprietary protocol stack from Myricom • These network stacks set the trend for high-performance communication requirements – Hardware offloaded protocol stack – Support for fast and secure user-level access to the protocol stack • Virtual Interface Architecture (VIA) – Standardized by Intel, Compaq, Microsoft – Precursor to IB CCGrid '11 Previous High-Performance Network Stacks 27
  • 28. • Some IB models have multiple hardware accelerators – E.g., Mellanox IB adapters • Protocol Offload Engines – Completely implement ISO/OSI layers 2-4 (link layer, network layer and transport layer) in hardware • Additional hardware supported features also present – RDMA, Multicast, QoS, Fault Tolerance, and many more CCGrid '11 IB Hardware Acceleration 28
  • 29. • Interrupt Coalescing – Improves throughput, but degrades latency • Jumbo Frames – No latency impact; Incompatible with existing switches • Hardware Checksum Engines – Checksum performed in hardware  significantly faster – Shown to have minimal benefit independently • Segmentation Offload Engines (a.k.a. Virtual MTU) – Host processor “thinks” that the adapter supports large Jumbo frames, but the adapter splits it into regular sized (1500-byte) frames – Supported by most HSE products because of its backward compatibility  considered “regular” Ethernet – Heavily used in the “server-on-steroids” model • High performance servers connected to regular clients CCGrid '11 Ethernet Hardware Acceleration 29
  • 30. • TCP Offload Engines (TOE) – Hardware Acceleration for the entire TCP/IP stack – Initially patented by Tehuti Networks – Actually refers to the IC on the network adapter that implements TCP/IP – In practice, usually referred to as the entire network adapter • Internet Wide-Area RDMA Protocol (iWARP) – Standardized by IETF and the RDMA Consortium – Support acceleration features (like IB) for Ethernet • http://www.ietf.org & http://www.rdmaconsortium.org CCGrid '11 TOE and iWARP Accelerators 30
  • 31. • Also known as “Datacenter Ethernet” or “Lossless Ethernet” – Combines a number of optional Ethernet standards into one umbrella as mandatory requirements • Sample enhancements include: – Priority-based flow-control: Link-level flow control for each Class of Service (CoS) – Enhanced Transmission Selection (ETS): Bandwidth assignment to each CoS – Datacenter Bridging Exchange Protocols (DBX): Congestion notification, Priority classes – End-to-end Congestion notification: Per flow congestion control to supplement per link flow control CCGrid '11 Converged (Enhanced) Ethernet (CEE or CE) 31
  • 32. • Network speed bottlenecks • Protocol processing bottlenecks • I/O interface bottlenecks CCGrid '11 32 Tackling Communication Bottlenecks with IB and HSE
  • 33. • InfiniBand initially intended to replace I/O bus technologies with networking-like technology – That is, bit serial differential signaling – With enhancements in I/O technologies that use a similar architecture (HyperTransport, PCI Express), this has become mostly irrelevant now • Both IB and HSE today come as network adapters that plug into existing I/O technologies CCGrid '11 Interplay with I/O Technologies 33
  • 34. • Recent trends in I/O interfaces show that they are nearly matching head-to-head with network speeds (though they still lag a little bit) CCGrid '11 Trends in I/O Interfaces with Servers PCI 1990 33MHz/32bit: 1.05Gbps (shared bidirectional) PCI-X 1998 (v1.0) 2003 (v2.0) 133MHz/64bit: 8.5Gbps (shared bidirectional) 266-533MHz/64bit: 17Gbps (shared bidirectional) AMD HyperTransport (HT) 2001 (v1.0), 2004 (v2.0) 2006 (v3.0), 2008 (v3.1) 102.4Gbps (v1.0), 179.2Gbps (v2.0) 332.8Gbps (v3.0), 409.6Gbps (v3.1) (32 lanes) PCI-Express (PCIe) by Intel 2003 (Gen1), 2007 (Gen2) 2009 (Gen3 standard) Gen1: 4X (8Gbps), 8X (16Gbps), 16X (32Gbps) Gen2: 4X (16Gbps), 8X (32Gbps), 16X (64Gbps) Gen3: 4X (~32Gbps), 8X (~64Gbps), 16X (~128Gbps) Intel QuickPath Interconnect (QPI) 2009 153.6-204.8Gbps (20 lanes) 34
  • 35. • Introduction • Why InfiniBand and High-speed Ethernet? • Overview of IB, HSE, their Convergence and Features • IB and HSE HW/SW Products and Installations • Sample Case Studies and Performance Numbers • Conclusions and Final Q&A CCGrid '11 Presentation Overview 35
  • 36. • InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics – Novel Features – Subnet Management and Services • High-speed Ethernet Family – Internet Wide Area RDMA Protocol (iWARP) – Alternate vendor-specific protocol stacks • InfiniBand/Ethernet Convergence Technologies – Virtual Protocol Interconnect (VPI) – (InfiniBand) RDMA over Ethernet (RoE) – (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE) CCGrid '11 IB, HSE and their Convergence 36
  • 37. CCGrid '11 37 Comparing InfiniBand with Traditional Networking Stack Application Layer MPI, PGAS, File Systems Transport Layer OpenFabrics Verbs RC (reliable), UD (unreliable) Link Layer Flow-control, Error Detection Physical Layer InfiniBand Copper or Optical HTTP, FTP, MPI, File Systems Routing Physical Layer Link Layer Network Layer Transport Layer Application Layer Traditional Ethernet Sockets Interface TCP, UDP Flow-control and Error Detection Copper, Optical or Wireless Network Layer Routing OpenSM (management tool) DNS management tools
  • 38. • InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics • Communication Model • Memory registration and protection • Channel and memory semantics – Novel Features • Hardware Protocol Offload – Link, network and transport layer features – Subnet Management and Services CCGrid '11 IB Overview 38
  • 39. • Used by processing and I/O units to connect to fabric • Consume & generate IB packets • Programmable DMA engines with protection features • May have multiple ports – Independent buffering channeled through Virtual Lanes • Host Channel Adapters (HCAs) CCGrid '11 Components: Channel Adapters 39 C Port VL VL VL … Port VL VL VL … Port VL VL VL … … DMA Memory QP QP QP QP … MTP SMA Transport Channel Adapter
  • 40. • Relay packets from a link to another • Switches: intra-subnet • Routers: inter-subnet • May support multicast CCGrid '11 Components: Switches and Routers 40 Packet Relay Port VL VL VL … Port VL VL VL … Port VL VL VL … … Switch GRH Packet Relay Port VL VL VL … Port VL VL VL … Port VL VL VL … … Router
  • 41. • Network Links – Copper, Optical, Printed Circuit wiring on Back Plane – Not directly addressable • Traditional adapters built for copper cabling – Restricted by cable length (signal integrity) – For example, QDR copper cables are restricted to 7m • Intel Connects: Optical cables with Copper-to-optical conversion hubs (acquired by Emcore) – Up to 100m length – 550 picoseconds copper-to-optical conversion latency • Available from other vendors (Luxtera) • Repeaters (Vol. 2 of InfiniBand specification) CCGrid '11 Components: Links & Repeaters (Courtesy Intel) 41
  • 42. • InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics • Communication Model • Memory registration and protection • Channel and memory semantics – Novel Features • Hardware Protocol Offload – Link, network and transport layer features – Subnet Management and Services CCGrid '11 IB Overview 42
  • 43. CCGrid '11 IB Communication Model Basic InfiniBand Communication Semantics 43
  • 44. • Each QP has two queues – Send Queue (SQ) – Receive Queue (RQ) – Work requests are queued to the QP (WQEs: “Wookies”) • QP to be linked to a Complete Queue (CQ) – Gives notification of operation completion from QPs – Completed WQEs are placed in the CQ with additional information (CQEs: “Cookies”) CCGrid '11 Queue Pair Model InfiniBand Device CQQP Send Recv WQEs CQEs 44
  • 45. 1. Registration Request • Send virtual address and length 2. Kernel handles virtual->physical mapping and pins region into physical memory • Process cannot map memory that it does not own (security !) 3. HCA caches the virtual to physical mapping and issues a handle • Includes an l_key and r_key 4. Handle is returned to application CCGrid '11 Memory Registration Before we do any communication: All memory used for communication must be registered 1 3 4 Process Kernel HCA/RNIC 2 45
  • 46. • To send or receive data the l_key must be provided to the HCA • HCA verifies access to local memory • For RDMA, initiator must have the r_key for the remote virtual address • Possibly exchanged with a send/recv • r_key is not encrypted in IB CCGrid '11 Memory Protection HCA/NIC Kernel Process l_key r_key is needed for RDMA operations For security, keys are required for all operations that touch buffers 46
  • 47. CCGrid '11 Communication in the Channel Semantics (Send/Receive Model) InfiniBand Device Memory Memory InfiniBand Device CQ QP Send Recv Memory Segment Send WQE contains information about the send buffer (multiple non-contiguous segments) Processor Processor CQ QP Send Recv Memory Segment Receive WQE contains information on the receive buffer (multiple non-contiguous segments); Incoming messages have to be matched to a receive WQE to know where to place Hardware ACK Memory Segment Memory Segment Memory Segment Processor is involved only to: 1. Post receive WQE 2. Post send WQE 3. Pull out completed CQEs from the CQ 47
  • 48. CCGrid '11 Communication in the Memory Semantics (RDMA Model) InfiniBand Device Memory Memory InfiniBand Device CQ QP Send Recv Memory Segment Send WQE contains information about the send buffer (multiple segments) and the receive buffer (single segment) Processor Processor CQ QP Send Recv Memory Segment Hardware ACK Memory Segment Memory Segment Initiator processor is involved only to: 1. Post send WQE 2. Pull out completed CQE from the send CQ No involvement from the target processor 48
  • 49. InfiniBand Device CCGrid '11 Communication in the Memory Semantics (Atomics) Memory Memory InfiniBand Device CQ QP Send Recv Memory Segment Send WQE contains information about the send buffer (single 64-bit segment) and the receive buffer (single 64-bit segment) Processor Processor CQ QP Send Recv 49 Source Memory Segment OP Destination Memory Segment Initiator processor is involved only to: 1. Post send WQE 2. Pull out completed CQE from the send CQ No involvement from the target processor IB supports compare-and-swap and fetch-and-add atomic operations
  • 50. • InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics • Communication Model • Memory registration and protection • Channel and memory semantics – Novel Features • Hardware Protocol Offload – Link, network and transport layer features – Subnet Management and Services CCGrid '11 IB Overview 50
  • 51. CCGrid '11 Hardware Protocol Offload Complete Hardware Implementations Exist 51
  • 52. • Buffering and Flow Control • Virtual Lanes, Service Levels and QoS • Switching and Multicast CCGrid '11 Link/Network Layer Capabilities 52
  • 53. • IB provides three-levels of communication throttling/control mechanisms – Link-level flow control (link layer feature) – Message-level flow control (transport layer feature): discussed later – Congestion control (part of the link layer features) • IB provides an absolute credit-based flow-control – Receiver guarantees that enough space is allotted for N blocks of data – Occasional update of available credits by the receiver • Has no relation to the number of messages, but only to the total amount of data being sent – One 1MB message is equivalent to 1024 1KB messages (except for rounding off at message boundaries) CCGrid '11 Buffering and Flow Control 53
  • 54. • Multiple virtual links within same physical link – Between 2 and 16 • Separate buffers and flow control – Avoids Head-of-Line Blocking • VL15: reserved for management • Each port supports one or more data VL CCGrid '11 Virtual Lanes 54
  • 55. • Service Level (SL): – Packets may operate at one of 16 different SLs – Meaning not defined by IB • SL to VL mapping: – SL determines which VL on the next link is to be used – Each port (switches, routers, end nodes) has a SL to VL mapping table configured by the subnet management • Partitions: – Fabric administration (through Subnet Manager) may assign specific SLs to different partitions to isolate traffic flows CCGrid '11 Service Levels and QoS 55
  • 56. • InfiniBand Virtual Lanes allow the multiplexing of multiple independent logical traffic flows on the same physical link • Providing the benefits of independent, separate networks while eliminating the cost and difficulties associated with maintaining two or more networks CCGrid '11 Traffic Segregation Benefits (Courtesy: Mellanox Technologies) Routers, Switches VPN’s, DSLAMs Storage Area Network RAID, NAS, Backup IPC, Load Balancing, Web Caches, ASP InfiniBand Network Virtual Lanes Servers Fabric ServersServers IP Network InfiniBand Fabric 56
  • 57. • Each port has one or more associated LIDs (Local Identifiers) – Switches look up which port to forward a packet to based on its destination LID (DLID) – This information is maintained at the switch • For multicast packets, the switch needs to maintain multiple output ports to forward the packet to – Packet is replicated to each appropriate output port – Ensures at-most once delivery & loop-free forwarding – There is an interface for a group management protocol • Create, join/leave, prune, delete group CCGrid '11 Switching (Layer-2 Routing) and Multicast 57
  • 58. • Basic unit of switching is a crossbar – Current InfiniBand products use either 24-port (DDR) or 36-port (QDR) crossbars • Switches available in the market are typically collections of crossbars within a single cabinet • Do not confuse “non-blocking switches” with “crossbars” – Crossbars provide all-to-all connectivity to all connected nodes • For any random node pair selection, all communication is non-blocking – Non-blocking switches provide a fat-tree of many crossbars • For any random node pair selection, there exists a switch configuration such that communication is non-blocking • If the communication pattern changes, the same switch configuration might no longer provide fully non-blocking communication CCGrid '11 58 Switch Complex
  • 59. • Someone has to setup the forwarding tables and give every port an LID – “Subnet Manager” does this work • Different routing algorithms give different paths CCGrid '11 IB Switching/Routing: An Example Leaf Blocks Spine Blocks P1P2 DLID Out-Port 2 1 4 4 Forwarding TableLID: 2 LID: 4 1 2 3 4 59 An Example IB Switch Block Diagram (Mellanox 144-Port) Switching: IB supports Virtual Cut Through (VCT) Routing: Unspecified by IB SPEC Up*/Down*, Shift are popular routing engines supported by OFED • Fat-Tree is a popular topology for IB Cluster – Different over-subscription ratio may be used • Other topologies are also being used – 3D Torus (Sandia Red Sky) and SGI Altix (Hypercube)
  • 60. • Similar to basic switching, except… – … sender can utilize multiple LIDs associated to the same destination port • Packets sent to one DLID take a fixed path • Different packets can be sent using different DLIDs • Each DLID can have a different path (switch can be configured differently for each DLID) • Can cause out-of-order arrival of packets – IB uses a simplistic approach: • If packets in one connection arrive out-of-order, they are dropped – Easier to use different DLIDs for different connections • This is what most high-level libraries using IB do! CCGrid '11 60 More on Multipathing
  • 62. CCGrid '11 Hardware Protocol Offload Complete Hardware Implementations Exist 62
  • 63. • Each transport service can have zero or more QPs associated with it – E.g., you can have four QPs based on RC and one QP based on UD CCGrid '11 IB Transport Services Service Type Connection Oriented Acknowledged Transport Reliable Connection Yes Yes IBA Unreliable Connection Yes No IBA Reliable Datagram No Yes IBA Unreliable Datagram No No IBA RAW Datagram No No Raw 63
  • 64. CCGrid '11 Trade-offs in Different Transport Types 64 Attribute Reliable Connection Reliable Datagram eXtended Reliable Connection Unreliable Connection Unreliable Datagram Raw Datagram Scalability (M processes, N nodes) M2N QPs per HCA M QPs per HCA MNQPs per HCA M2N QPs per HCA M QPs per HCA 1 QP per HCA Reliability Corrupt data detected Yes Data Delivery Guarantee Data delivered exactly once No guarantees Data Order Guarantees Per connection One source to multiple destinations Per connection Unordered, duplicate data detected No No Data Loss Detected Yes No No Error Recovery Errors (retransmissions, alternate path, etc.) handled by transport layer. Client only involved in handling fatal errors (links broken, protection violation, etc.) Packets with errors and sequence errors are reported to responder None None
  • 65. • Data Segmentation • Transaction Ordering • Message-level Flow Control • Static Rate Control and Auto-negotiation CCGrid '11 Transport Layer Capabilities 65
  • 66. • IB transport layer provides a message-level communication granularity, not byte-level (unlike TCP) • Application can hand over a large message – Network adapter segments it to MTU sized packets – Single notification when the entire message is transmitted or received (not per packet) • Reduced host overhead to send/receive messages – Depends on the number of messages, not the number of bytes CCGrid '11 66 Data Segmentation
  • 67. • IB follows a strong transaction ordering for RC • Sender network adapter transmits messages in the order in which WQEs were posted • Each QP utilizes a single LID – All WQEs posted on same QP take the same path – All packets are received by the receiver in the same order – All receive WQEs are completed in the order in which they were posted CCGrid '11 Transaction Ordering 67
  • 68. • Also called as End-to-end Flow-control – Does not depend on the number of network hops • Separate from Link-level Flow-Control – Link-level flow-control only relies on the number of bytes being transmitted, not the number of messages – Message-level flow-control only relies on the number of messages transferred, not the number of bytes • If 5 receive WQEs are posted, the sender can send 5 messages (can post 5 send WQEs) – If the sent messages are larger than what the receive buffers are posted, flow-control cannot handle it CCGrid '11 68 Message-level Flow-Control
  • 69. • IB allows link rates to be statically changed – On a 4X link, we can set data to be sent at 1X – For heterogeneous links, rate can be set to the lowest link rate – Useful for low-priority traffic • Auto-negotiation also available – E.g., if you connect a 4X adapter to a 1X switch, data is automatically sent at 1X rate • Only fixed settings available – Cannot set rate requirement to 3.16 Gbps, for example CCGrid '11 69 Static Rate Control and Auto-Negotiation
  • 70. • InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics • Communication Model • Memory registration and protection • Channel and memory semantics – Novel Features • Hardware Protocol Offload – Link, network and transport layer features – Subnet Management and Services CCGrid '11 IB Overview 70
  • 71. • Agents – Processes or hardware units running on each adapter, switch, router (everything on the network) – Provide capability to query and set parameters • Managers – Make high-level decisions and implement it on the network fabric using the agents • Messaging schemes – Used for interactions between the manager and agents (or between agents) • Messages CCGrid '11 Concepts in IB Management 71
  • 73. • InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics – Novel Features – Subnet Management and Services • High-speed Ethernet Family – Internet Wide Area RDMA Protocol (iWARP) – Alternate vendor-specific protocol stacks • InfiniBand/Ethernet Convergence Technologies – Virtual Protocol Interconnect (VPI) – (InfiniBand) RDMA over Ethernet (RoE) – (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE) CCGrid '11 IB, HSE and their Convergence 73
  • 74. • High-speed Ethernet Family – Internet Wide-Area RDMA Protocol (iWARP) • Architecture and Components • Features – Out-of-order data placement – Dynamic and Fine-grained Data Rate control – Multipathing using VLANs • Existing Implementations of HSE/iWARP – Alternate Vendor-specific Stacks • MX over Ethernet (for Myricom 10GE adapters) • Datagram Bypass Layer (for Myricom 10GE adapters) • Solarflare OpenOnload (for Solarflare 10GE adapters) CCGrid '11 HSE Overview 74
  • 75. CCGrid '11 IB and HSE RDMA Models: Commonalities and Differences IB iWARP/HSE Hardware Acceleration Supported Supported RDMA Supported Supported Atomic Operations Supported Not supported Multicast Supported Supported Congestion Control Supported Supported Data Placement Ordered Out-of-order Data Rate-control Static and Coarse-grained Dynamic and Fine-grained QoS Prioritization Prioritization and Fixed Bandwidth QoS Multipathing Using DLIDs Using VLANs 75
  • 76. • RDMA Protocol (RDMAP) – Feature-rich interface – Security Management • Remote Direct Data Placement (RDDP) – Data Placement and Delivery – Multi Stream Semantics – Connection Management • Marker PDU Aligned (MPA) – Middle Box Fragmentation – Data Integrity (CRC) CCGrid '11 iWARP Architecture and Components RDDP Application or Library Hardware User SCTP IP Device Driver Network Adapter (e.g., 10GigE) iWARP Offload Engines TCP RDMAP MPA (Courtesy iWARP Specification) 76
  • 77. • High-speed Ethernet Family – Internet Wide-Area RDMA Protocol (iWARP) • Architecture and Components • Features – Out-of-order data placement – Dynamic and Fine-grained Data Rate control • Existing Implementations of HSE/iWARP – Alternate Vendor-specific Stacks • MX over Ethernet (for Myricom 10GE adapters) • Datagram Bypass Layer (for Myricom 10GE adapters) • Solarflare OpenOnload (for Solarflare 10GE adapters) CCGrid '11 HSE Overview 77
  • 78. • Place data as it arrives, whether in or out-of-order • If data is out-of-order, place it at the appropriate offset • Issues from the application’s perspective: – Second half of the message has been placed does not mean that the first half of the message has arrived as well – If one message has been placed, it does not mean that that the previous messages have been placed • Issues from protocol stack’s perspective – The receiver network stack has to understand each frame of data • If the frame is unchanged during transmission, this is easy! – The MPA protocol layer adds appropriate information at regular intervals to allow the receiver to identify fragmented frames CCGrid '11 Decoupled Data Placement and Data Delivery 78
  • 79. • Part of the Ethernet standard, not iWARP – Network vendors use a separate interface to support it • Dynamic bandwidth allocation to flows based on interval between two packets in a flow – E.g., one stall for every packet sent on a 10 Gbps network refers to a bandwidth allocation of 5 Gbps – Complicated because of TCP windowing behavior • Important for high-latency/high-bandwidth networks – Large windows exposed on the receiver side – Receiver overflow controlled through rate control CCGrid '11 Dynamic and Fine-grained Rate Control 79
  • 80. • Can allow for simple prioritization: – E.g., connection 1 performs better than connection 2 – 8 classes provided (a connection can be in any class) • Similar to SLs in InfiniBand – Two priority classes for high-priority traffic • E.g., management traffic or your favorite application • Or can allow for specific bandwidth requests: – E.g., can request for 3.62 Gbps bandwidth – Packet pacing and stalls used to achieve this • Query functionality to find out “remaining bandwidth” CCGrid '11 Prioritization and Fixed Bandwidth QoS 80
  • 81. • High-speed Ethernet Family – Internet Wide-Area RDMA Protocol (iWARP) • Architecture and Components • Features – Out-of-order data placement – Dynamic and Fine-grained Data Rate control • Existing Implementations of HSE/iWARP – Alternate Vendor-specific Stacks • MX over Ethernet (for Myricom 10GE adapters) • Datagram Bypass Layer (for Myricom 10GE adapters) • Solarflare OpenOnload (for Solarflare 10GE adapters) CCGrid '11 HSE Overview 81
  • 82. • Regular Ethernet adapters and TOEs are fully compatible • Compatibility with iWARP required • Software iWARP emulates the functionality of iWARP on the host – Fully compatible with hardware iWARP – Internally utilizes host TCP/IP stack CCGrid '11 Software iWARP based Compatibility 82 Wide Area Network Distributed Cluster Environment Regular Ethernet Cluster iWARP Cluster TOE Cluster
  • 83. CCGrid '11 Different iWARP Implementations Regular Ethernet Adapters Application High Performance Sockets Sockets Network Adapter TCP IP Device Driver Offloaded TCP Offloaded IP Software iWARP TCP Offload Engines Application High Performance Sockets Sockets Network Adapter TCP IP Device Driver Offloaded TCP Offloaded IP Offloaded iWARP iWARP compliant Adapters Application Kernel-level iWARP TCP (Modified with MPA) IP Device Driver Network Adapter Sockets Application User-level iWARP IP Sockets TCP Device Driver Network Adapter OSU, OSC, IBM OSU, ANL Chelsio, NetEffect (Intel) 83
  • 84. • High-speed Ethernet Family – Internet Wide-Area RDMA Protocol (iWARP) • Architecture and Components • Features – Out-of-order data placement – Dynamic and Fine-grained Data Rate control – Multipathing using VLANs • Existing Implementations of HSE/iWARP – Alternate Vendor-specific Stacks • MX over Ethernet (for Myricom 10GE adapters) • Datagram Bypass Layer (for Myricom 10GE adapters) • Solarflare OpenOnload (for Solarflare 10GE adapters) CCGrid '11 HSE Overview 84
  • 85. • Proprietary communication layer developed by Myricom for their Myrinet adapters – Third generation communication layer (after FM and GM) – Supports Myrinet-2000 and the newer Myri-10G adapters • Low-level “MPI-like” messaging layer – Almost one-to-one match with MPI semantics (including connection- less model, implicit memory registration and tag matching) – Later versions added some more advanced communication methods such as RDMA to support other programming models such as ARMCI (low-level runtime for the Global Arrays PGAS library) • Open-MX – New open-source implementation of the MX interface for non- Myrinet adapters from INRIA, France CCGrid '11 85 Myrinet Express (MX)
  • 86. • Another proprietary communication layer developed by Myricom – Compatible with regular UDP sockets (embraces and extends) – Idea is to bypass the kernel stack and give UDP applications direct access to the network adapter • High performance and low-jitter • Primary motivation: Financial market applications (e.g., stock market) – Applications prefer unreliable communication – Timeliness is more important than reliability • This stack is covered by NDA; more details can be requested from Myricom CCGrid '11 86 Datagram Bypass Layer (DBL)
  • 87. CCGrid '11 Solarflare Communications: OpenOnload Stack 87 Typical HPC Networking StackTypical Commodity Networking Stack • HPC Networking Stack provides many performance benefits, but has limitations for certain types of scenarios, especially where applications tend to fork(), exec() and need asynchronous advancement (per application) Solarflare approach to networking stack • Solarflare approach:  Network hardware provides user-safe interface to route packets directly to apps based on flow information in headers  Protocol processing can happen in both kernel and user space  Protocol state shared between app and kernel using shared memory Courtesy Solarflare communications (www.openonload.org/openonload-google-talk.pdf)
  • 88. • InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics – Novel Features – Subnet Management and Services • High-speed Ethernet Family – Internet Wide Area RDMA Protocol (iWARP) – Alternate vendor-specific protocol stacks • InfiniBand/Ethernet Convergence Technologies – Virtual Protocol Interconnect (VPI) – (InfiniBand) RDMA over Ethernet (RoE) – (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE) CCGrid '11 IB, HSE and their Convergence 88
  • 89. • Single network firmware to support both IB and Ethernet • Autosensing of layer-2 protocol – Can be configured to automatically work with either IB or Ethernet networks • Multi-port adapters can use one port on IB and another on Ethernet • Multiple use modes: – Datacenters with IB inside the cluster and Ethernet outside – Clusters with IB network and Ethernet management CCGrid '11 Virtual Protocol Interconnect (VPI) IB Link Layer IB Port Ethernet Port Hardware TCP/IP support Ethernet Link Layer IB Network Layer IP IB Transport Layer TCP IB Verbs Sockets Applications 89
  • 90. • Native convergence of IB network and transport layers with Ethernet link layer • IB packets encapsulated in Ethernet frames • IB network layer already uses IPv6 frames • Pros: – Works natively in Ethernet environments (entire Ethernet management ecosystem is available) – Has all the benefits of IB verbs • Cons: – Network bandwidth might be limited to Ethernet switches: 10GE switches available; 40GE yet to arrive; 32 Gbps IB available – Some IB native link-layer features are optional in (regular) Ethernet • Approved by OFA board to be included into OFED CCGrid '11 (InfiniBand) RDMA over Ethernet (IBoE or RoE) Ethernet IB Network IB Transport IB Verbs Application Hardware 90
  • 91. • Very similar to IB over Ethernet – Often used interchangeably with IBoE – Can be used to explicitly specify link layer is Converged (Enhanced) Ethernet (CE) • Pros: – Works natively in Ethernet environments (entire Ethernet management ecosystem is available) – Has all the benefits of IB verbs – CE is very similar to the link layer of native IB, so there are no missing features • Cons: – Network bandwidth might be limited to Ethernet switches: 10GE switches available; 40GE yet to arrive; 32 Gbps IB available CCGrid '11 (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE) CE IB Network IB Transport IB Verbs Application Hardware 91
  • 92. CCGrid '11 IB and HSE: Feature Comparison IB iWARP/HSE RoE RoCE Hardware Acceleration Yes Yes Yes Yes RDMA Yes Yes Yes Yes Congestion Control Yes Optional Optional Yes Multipathing Yes Yes Yes Yes Atomic Operations Yes No Yes Yes Multicast Optional No Optional Optional Data Placement Ordered Out-of-order Ordered Ordered Prioritization Optional Optional Optional Yes Fixed BW QoS (ETS) No Optional Optional Yes Ethernet Compatibility No Yes Yes Yes TCP/IP Compatibility Yes (using IPoIB) Yes Yes (using IPoIB) Yes (using IPoIB) 92
  • 93. • Introduction • Why InfiniBand and High-speed Ethernet? • Overview of IB, HSE, their Convergence and Features • IB and HSE HW/SW Products and Installations • Sample Case Studies and Performance Numbers • Conclusions and Final Q&A CCGrid '11 Presentation Overview 93
  • 94. • Many IB vendors: Mellanox+Voltaire and Qlogic – Aligned with many server vendors: Intel, IBM, SUN, Dell – And many integrators: Appro, Advanced Clustering, Microway • Broadly two kinds of adapters – Offloading (Mellanox) and Onloading (Qlogic) • Adapters with different interfaces: – Dual port 4X with PCI-X (64 bit/133 MHz), PCIe x8, PCIe 2.0 and HT • MemFree Adapter – No memory on HCA  Uses System memory (through PCIe) – Good for LOM designs (Tyan S2935, Supermicro 6015T-INFB) • Different speeds – SDR (8 Gbps), DDR (16 Gbps) and QDR (32 Gbps) • Some 12X SDR adapters exist as well (24 Gbps each way) • New ConnectX-2 adapter from Mellanox supports offload for collectives (Barrier, Broadcast, etc.) CCGrid '11 IB Hardware Products 94
  • 95. CCGrid '11 Tyan Thunder S2935 Board (Courtesy Tyan) Similar boards from Supermicro with LOM features are also available 95
  • 96. • Customized adapters to work with IB switches – Cray XD1 (formerly by Octigabay), Cray CX1 • Switches: – 4X SDR and DDR (8-288 ports); 12X SDR (small sizes) – 3456-port “Magnum” switch from SUN  used at TACC • 72-port “nano magnum” – 36-port Mellanox InfiniScale IV QDR switch silicon in 2008 • Up to 648-port QDR switch by Mellanox and SUN • Some internal ports are 96 Gbps (12X QDR) – IB switch silicon from Qlogic introduced at SC ’08 • Up to 846-port QDR switch by Qlogic – New FDR (56 Gbps) switch silicon (Bridge-X) has been announced by Mellanox in May ‘11 • Switch Routers with Gateways – IB-to-FC; IB-to-IP CCGrid '11 IB Hardware Products (contd.) 96
  • 97. • 10GE adapters: Intel, Myricom, Mellanox (ConnectX) • 10GE/iWARP adapters: Chelsio, NetEffect (now owned by Intel) • 40GE adapters: Mellanox ConnectX2-EN 40G • 10GE switches – Fulcrum Microsystems • Low latency switch based on 24-port silicon • FM4000 switch with IP routing, and TCP/UDP support – Fujitsu, Myricom (512 ports), Force10, Cisco, Arista (formerly Arastra) • 40GE and 100GE switches – Nortel Networks • 10GE downlinks with 40GE and 100GE uplinks – Broadcom has announced 40GE switch in early 2010 CCGrid '11 10G, 40G and 100G Ethernet Products 97
  • 98. • Mellanox ConnectX Adapter • Supports IB and HSE convergence • Ports can be configured to support IB or HSE • Support for VPI and RoCE – 8 Gbps (SDR), 16Gbps (DDR) and 32Gbps (QDR) rates available for IB – 10GE rate available for RoCE – 40GE rate for RoCE is expected to be available in near future CCGrid '11 Products Providing IB and HSE Convergence 98
  • 99. • Open source organization (formerly OpenIB) – www.openfabrics.org • Incorporates both IB and iWARP in a unified manner – Support for Linux and Windows – Design of complete stack with `best of breed’ components • Gen1 • Gen2 (current focus) • Users can download the entire stack and run – Latest release is OFED 1.5.3 – OFED 1.6 is underway CCGrid '11 Software Convergence with OpenFabrics 99
  • 100. CCGrid '11 OpenFabrics Stack with Unified Verbs Interface Verbs Interface (libibverbs) Mellanox (libmthca) QLogic (libipathverbs) IBM (libehca) Chelsio (libcxgb3) Mellanox (ib_mthca) QLogic (ib_ipath) IBM (ib_ehca) Chelsio (ib_cxgb3) User Level Kernel Level Mellanox Adapters QLogic Adapters Chelsio Adapters IBM Adapters 100
  • 101. • For IBoE and RoCE, the upper-level stacks remain completely unchanged • Within the hardware: – Transport and network layers remain completely unchanged – Both IB and Ethernet (or CEE) link layers are supported on the network adapter • Note: The OpenFabrics stack is not valid for the Ethernet path in VPI – That still uses sockets and TCP/IP CCGrid '11 OpenFabrics on Convergent IB/HSE Verbs Interface (libibverbs) ConnectX (libmlx4) ConnectX (ib_mlx4) User Level Kernel Level ConnectX Adapters HSE IB 101
  • 102. CCGrid '11 OpenFabrics Software Stack SA Subnet Administrator MAD Management Datagram SMA Subnet Manager Agent PMA Performance Manager Agent IPoIB IP over InfiniBand SDP Sockets Direct Protocol SRP SCSI RDMA Protocol (Initiator) iSER iSCSI RDMA Protocol (Initiator) RDS Reliable Datagram Service UDAPL User Direct Access Programming Lib HCA Host Channel Adapter R-NIC RDMA NIC Common InfiniBand iWARP Key InfiniBand HCA iWARP R-NIC Hardware Specific Driver Hardware Specific Driver Connection Manager MAD InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC SA Client Connection Manager Connection Manager Abstraction (CMA) InfiniBand OpenFabrics User Level Verbs / API iWARP R-NIC SDPIPoIB SRP iSER RDS SDP Lib User Level MAD API Open SM Diag Tools Hardware Provider Mid-Layer Upper Layer Protocol User APIs Kernel Space User Space NFS-RDMA RPC Cluster File Sys Application Level SMA Clustered DB Access Sockets Based Access Various MPIs Access to File Systems Block Storage Access IP Based App Access Apps & Access Methods for using OF Stack UDAPL Kernelbypass Kernelbypass 102
  • 103. CCGrid '11 103 InfiniBand in the Top500 Percentage share of InfiniBand is steadily increasing
  • 104. 45% 43% 6% 1% 0% 0% 1% 0% 0%0% 4% Number of Systems Gigabit Ethernet InfiniBand Proprietary Myrinet Quadrics Mixed NUMAlink SP Switch Cray Interconnect Fat Tree Custom CCGrid '11 104 InfiniBand in the Top500 (Nov. 2010) 25% 44% 21% 1%0% 0% 0% 0% 0% 0% 9% Performance Gigabit Ethernet Infiniband Proprietary Myrinet Quadrics Mixed NUMAlink SP Switch Cray Interconnect Fat Tree Custom
  • 105. 105 InfiniBand System Efficiency in the Top500 List CCGrid '11 0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 350 400 450 500 Efficiency(%) Top 500 Systems Computer Cluster Efficiency Comparison IB-CPU IB-GPU/Cell GigE 10GigE IBM-BlueGene Cray
  • 106. • 214 IB Clusters (42.8%) in the Nov ‘10 Top500 list (http://www.top500.org) • Installations in the Top 30 (13 systems): CCGrid '11 Large-scale InfiniBand Installations 120,640 cores (Nebulae) in China (3rd) 15,120 cores (Loewe) in Germany (22nd) 73,278 cores (Tsubame-2.0) at Japan (4th) 26,304 cores (Juropa) in Germany (23rd) 138,368 cores (Tera-100) at France (6th) 26,232 cores (Tachyonll) in South Korea (24th) 122,400 cores (RoadRunner) at LANL (7th) 23,040 cores (Jade) at GENCI (27th) 81,920 cores (Pleiades) at NASA Ames (11th) 33,120 cores (Mole-8.5) in China (28th) 42,440 cores (Red Sky) at Sandia (14th) More are getting installed ! 62,976 cores (Ranger) at TACC (15th) 35,360 cores (Lomonosov) in Russia (17th) 106
  • 107. • HSE compute systems with ranking in the Nov 2010 Top500 list – 8,856-core installation in Purdue with ConnectX-EN 10GigE (#126) – 7,944-core installation in Purdue with 10GigE Chelsio/iWARP (#147) – 6,828-core installation in Germany (#166) – 6,144-core installation in Germany (#214) – 6,144-core installation in Germany (#215) – 7,040-core installation at the Amazon EC2 Cluster (#231) – 4,000-core installation at HHMI (#349) • Other small clusters – 640-core installation in University of Heidelberg, Germany – 512-core installation at Sandia National Laboratory (SNL) with Chelsio/iWARP and Woven Systems switch – 256-core installation at Argonne National Lab with Myri-10G • Integrated Systems – BG/P uses 10GE for I/O (ranks 9, 13, 16, and 34 in the Top 50) CCGrid '11 107 HSE Scientific Computing Installations
  • 108. • HSE has most of its popularity in enterprise computing and other non-scientific markets including Wide-area networking • Example Enterprise Computing Domains – Enterprise Datacenters (HP, Intel) – Animation firms (e.g., Universal Studios (“The Hulk”), 20th Century Fox (“Avatar”), and many new movies using 10GE) – Amazon’s HPC cloud offering uses 10GE internally – Heavily used in financial markets (users are typically undisclosed) • Many Network-attached Storage devices come integrated with 10GE network adapters • ESnet to install $62M 100GE infrastructure for US DOE CCGrid '11 108 Other HSE Installations
  • 109. • Introduction • Why InfiniBand and High-speed Ethernet? • Overview of IB, HSE, their Convergence and Features • IB and HSE HW/SW Products and Installations • Sample Case Studies and Performance Numbers • Conclusions and Final Q&A CCGrid '11 Presentation Overview 109
  • 110. Modern Interconnects and Protocols 110 Application VerbsSockets Application Interface TCP/IP Hardware Offload TCP/IP Ethernet Driver Kernel Space Protocol Implementation 1/10 GigE Adapter Ethernet Switch Network Adapter Network Switch 1/10 GigE InfiniBand Adapter InfiniBand Switch IPoIB IPoIB SDP RDMA User space IB Verbs InfiniBand Adapter InfiniBand Switch SDP InfiniBand Adapter InfiniBand Switch RDMA 10 GigE Adapter 10 GigE Switch 10 GigE-TOE
  • 111. • Low-level Network Performance • Clusters with Message Passing Interface (MPI) • Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB) • InfiniBand in WAN and Grid-FTP • Cloud Computing: Hadoop and Memcached CCGrid '11 111 Case Studies
  • 112. CCGrid '11 112 Low-level Latency Measurements 0 5 10 15 20 25 30 VPI-IB Native IB VPI-Eth RoCE Small Messages Latency(us) Message Size (bytes) 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Large Messages Latency(us) Message Size (bytes) ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches RoCE has a slight overhead compared to native IB because it operates at a slower clock rate (required to support only a 10Gbps link for Ethernet, as compared to a 32Gbps link for IB)
  • 113. CCGrid '11 113 Low-level Uni-directional Bandwidth Measurements 0 200 400 600 800 1000 1200 1400 1600 VPI-IB Native IB VPI-Eth RoCE Uni-directional Bandwidth Bandwidth(MBps) Message Size (bytes) ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches
  • 114. • Low-level Network Performance • Clusters with Message Passing Interface (MPI) • Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB) • InfiniBand in WAN and Grid-FTP • Cloud Computing: Hadoop and Memcached CCGrid '11 114 Case Studies
  • 115. • High Performance MPI Library for IB and HSE – MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2) – Used by more than 1,550 organizations in 60 countries – More than 66,000 downloads from OSU site directly – Empowering many TOP500 clusters • 11th ranked 81,920-core cluster (Pleiades) at NASA • 15th ranked 62,976-core cluster (Ranger) at TACC – Available with software stacks of many IB, HSE and server vendors including Open Fabrics Enterprise Distribution (OFED) and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu CCGrid '11 115 MVAPICH/MVAPICH2 Software
  • 116. CCGrid '11 116 One-way Latency: MPI over IB 0 1 2 3 4 5 6 Small Message Latency Message Size (bytes) Latency(us) 1.96 1.54 1.60 2.17 0 50 100 150 200 250 300 350 400 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR-PCIe2 MVAPICH-ConnectX-DDR MVAPICH-ConnectX-QDR-PCIe2 Latency(us) Message Size (bytes) Large Message Latency All numbers taken on 2.4 GHz Quad-core (Nehalem) Intel with IB switch
  • 117. CCGrid '11 117 Bandwidth: MPI over IB 0 500 1000 1500 2000 2500 3000 3500 Unidirectional Bandwidth MillionBytes/sec Message Size (bytes) 2665.6 3023.7 1901.1 1553.2 0 1000 2000 3000 4000 5000 6000 7000 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR-PCIe2 MVAPICH-ConnectX-DDR MVAPICH-ConnectX-QDR-PCIe2 Bidirectional Bandwidth MillionBytes/sec Message Size (bytes) 2990.1 3244.1 3642.8 5835.7 All numbers taken on 2.4 GHz Quad-core (Nehalem) Intel with IB switch
  • 118. CCGrid '11 118 One-way Latency: MPI over iWARP 0 10 20 30 40 50 60 70 80 90 Chelsio (TCP/IP) Chelsio (iWARP) Intel-NetEffect (TCP/IP) Intel-NetEffect (iWARP) Message Size (bytes) One-way Latency Latency(us) 25.43 2.4 GHz Quad-core Intel (Clovertown) with 10GE (Fulcrum) Switch 6.88 15.47 6.25
  • 119. CCGrid '11 119 Bandwidth: MPI over iWARP 0 200 400 600 800 1000 1200 1400 Message Size (bytes) Unidirectional Bandwidth MillionBytes/sec 839.8 1169.7 373.3 1245.0 0 500 1000 1500 2000 2500 Chelsio (TCP/IP) Chelsio (iWARP) Intel-NetEffect (TCP/IP) Intel-NetEffect (iWARP) Message Size (bytes) MillionBytes/sec Bidirectional Bandwidth 2260.8 855.3 2029.5 2.33 GHz Quad-core Intel (Clovertown) with 10GE (Fulcrum) Switch 647.1
  • 120. CCGrid '11 120 Convergent Technologies: MPI Latency 0 10 20 30 40 50 60 Small Messages Latency(us) Message Size (bytes) 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 Native IB VPI-IB VPI-Eth RoCE Large Messages Latency(us) Message Size (bytes) ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches 37.65 3.6 2.2 1.8
  • 121. CCGrid '11 121 Convergent Technologies: MPI Uni- and Bi-directional Bandwidth 0 200 400 600 800 1000 1200 1400 1600 Native IB VPI-IB VPI-Eth RoCE Uni-directional Bandwidth Bandwidth(MBps) Message Size (bytes) 0 500 1000 1500 2000 2500 3000 3500 Bi-directional Bandwidth Bandwidth(MBps) Message Size (bytes) ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches 404.1 1115.7 1518.1 1518.1 1085.5 2114.9 2880.4 2825.6
  • 122. • Low-level Network Performance • Clusters with Message Passing Interface (MPI) • Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB) • InfiniBand in WAN and Grid-FTP • Cloud Computing: Hadoop and Memcached CCGrid '11 122 Case Studies
  • 123. CCGrid '11 123 IPoIB vs. SDP Architectural Models Traditional Model Possible SDP Model Sockets App Sockets API Sockets Application Sockets API Kernel TCP/IP Sockets Provider TCP/IP Transport Driver IPoIB Driver User InfiniBand CA Kernel TCP/IP Sockets Provider TCP/IP Transport Driver IPoIB Driver User InfiniBand CA Sockets Direct Protocol Kernel Bypass RDMA Semantics SDP OS Modules InfiniBand Hardware (Source: InfiniBand Trade Association)
  • 124. CCGrid '11 124 SDP vs. IPoIB (IB QDR) 0 500 1000 1500 2000 2 8 32 128 512 2K 8K 32K Bandwidth(MBps) IPoIB-RC IPoIB-UD SDP 0 5 10 15 20 25 30 2 4 8 16 32 64 128 256 512 1K 2K Latency(us) 0 500 1000 1500 2000 2500 3000 3500 4000 2 8 32 128 512 2K 8K 32K BidirBandwidth(MBps) SDP enables high bandwidth (up to 15 Gbps), low latency (6.6 µs)
  • 125. • Low-level Network Performance • Clusters with Message Passing Interface (MPI) • Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB) • InfiniBand in WAN and Grid-FTP • Cloud Computing: Hadoop and Memcached CCGrid '11 125 Case Studies
  • 126. • Option 1: Layer-1 Optical networks – IB standard specifies link, network and transport layers – Can use any layer-1 (though the standard says copper and optical) • Option 2: Link Layer Conversion Techniques – InfiniBand-to-Ethernet conversion at the link layer: switches available from multiple companies (e.g., Obsidian) • Technically, it’s not conversion; it’s just tunneling (L2TP) – InfiniBand’s network layer is IPv6 compliant CCGrid '11 126 IB on the WAN
  • 127. Features • End-to-end guaranteed bandwidth channels • Dynamic, in-advance, reservation and provisioning of fractional/full lambdas • Secure control-plane for signaling • Peering with ESnet, National Science Foundation CHEETAH, and other networks CCGrid '11 127 UltraScience Net: Experimental Research Network Testbed Since 2005 This and the following IB WAN slides are courtesy Dr. Nagi Rao (ORNL)
  • 128. • Supports SONET OC-192 or 10GE LAN-PHY/WAN-PHY • Idea is to make remote storage “appear” local • IB-WAN switch does frame conversion – IB standard allows per-hop credit- based flow control – IB-WAN switch uses large internal buffers to allow enough credits to fill the wire CCGrid '11 128 IB-WAN Connectivity with Obsidian Switches CPU CPU System Controller System Memory HCA Host bus Storage Wide-area SONET/10GE IB switch IB WAN
  • 129. CCGrid '11 129 InfiniBand Over SONET: Obsidian Longbows RDMA throughput measurements over USN Linux host ORNL 700 miles Linux host Chicago CDCI Seattle CDCI Sunnyvale CDCI ORNL CDCI longbow IB/S longbow IB/S 3300 miles 4300 miles ORNL loop -0.2 mile: 7.48Gbps IB 4x: 8Gbps (full speed) Host-to-host local switch:7.5Gbps IB 4x ORNL-Chicago loop – 1400 miles: 7.47Gbps ORNL- Chicago - Seattle loop – 6600 miles: 7.37Gbps ORNL – Chicago – Seattle - Sunnyvale loop – 8600 miles: 7.34Gbps OC192 Quad-core Dual socket Hosts: Dual-socket Quad-core 2 GHz AMD Opteron, 4GB memory, 8-lane PCI- Express slot, Dual-port Voltaire 4x SDR HCA
  • 130. CCGrid '11 130 IB over 10GE LAN-PHY and WAN-PHY Linux host ORNL 700 miles Linux host Seattle CDCI ORNL CDCI longbow IB/S longbow IB/S 3300 miles 4300 miles ORNL loop -0.2 mile: 7.5Gbps IB 4x ORNL-Chicago loop – 1400 miles: 7.49Gbps ORNL- Chicago - Seattle loop – 6600 miles: 7.39Gbps ORNL – Chicago – Seattle - Sunnyvale loop – 8600 miles: 7.36Gbps OC192 Quad-core Dual socket Chicago CDCI Chicago E300 Sunnyvale E300 Sunnyvale CDCI OC 192 10GE WAN PHY
  • 131. MPI over IB-WAN: Obsidian Routers Delay (us) Distance (km) 10 2 100 20 1000 200 10000 2000 Cluster A Cluster BWAN Link Obsidian WAN Router Obsidian WAN Router (Variable Delay) (Variable Delay) MPI Bidirectional Bandwidth Impact of Encryption on Message Rate (Delay 0 ms) Hardware encryption has no impact on performance for less communicating streams 131CCGrid '11 S. Narravula, H. Subramoni, P. Lai, R. Noronha and D. K. Panda, Performance of HPC Middleware over InfiniBand WAN, Int'l Conference on Parallel Processing (ICPP '08), September 2008.
  • 132. Communication Options in Grid • Multiple options exist to perform data transfer on Grid • Globus-XIO framework currently does not support IB natively • We create the Globus-XIO ADTS driver and add native IB support to GridFTP Globus XIO Framework GridFTP High Performance Computing Applications 10 GigE Network IB Verbs IPoIB RoCE TCP/IP Obsidian Routers 132 CCGrid '11
  • 133. Globus-XIO Framework with ADTS Driver Globus XIO Driver #n Data Connection Management Persistent Session Management Buffer & File Management Data Transport Interface InfiniBand / RoCE 10GigE/iWARP Globus XIO Interface File System User Globus-XIO ADTS Driver Modern WAN Interconnects Network Flow ControlZero Copy Channel Memory Registration Globus XIO Driver #1 133CCGrid '11
  • 134. 134 Performance of Memory Based Data Transfer • Performance numbers obtained while transferring 128 GB of aggregate data in chunks of 256 MB files • ADTS based implementation is able to saturate the link bandwidth • Best performance for ADTS obtained when performing data transfer with a network buffer of size 32 MB 0 200 400 600 800 1000 1200 0 10 100 1000 Bandwidth(MBps) Network Delay (us) ADTS Driver 2 MB 8 MB 32 MB 64 MB 0 200 400 600 800 1000 1200 0 10 100 1000 Bandwidth(MBps) Network Delay (us) UDT Driver 2 MB 8 MB 32 MB 64 MB CCGrid '11
  • 135. 135 Performance of Disk Based Data Transfer • Performance numbers obtained while transferring 128 GB of aggregate data in chunks of 256 MB files • Predictable as well as better performance when Disk-IO threads assist network thread (Asynchronous Mode) • Best performance for ADTS obtained with a circular buffer with individual buffers of size 64 MB 0 100 200 300 400 0 10 100 1000 Bandwidth(MBps) Network Delay (us) Synchronous Mode ADTS-8MB ADTS-16MB ADTS-32MB ADTS-64MB IPoIB-64MB 0 100 200 300 400 0 10 100 1000 Bandwidth(MBps) Network Delay (us) Asynchronous Mode ADTS-8MB ADTS-16MB ADTS-32MB ADTS-64MB IPoIB-64MB CCGrid '11
  • 136. 136 Application Level Performance 0 50 100 150 200 250 300 CCSM Ultra-Viz Bandwidth(MBps) Target Applications ADTS IPoIB • Application performance for FTP get operation for disk based transfers • Community Climate System Model (CCSM) • Part of Earth System Grid Project • Transfers 160 TB of total data in chunks of 256 MB • Network latency - 30 ms • Ultra-Scale Visualization (Ultra-Viz) • Transfers files of size 2.6 GB • Network latency - 80 ms The ADTS driver out performs the UDT driver using IPoIB by more than 100% H. Subramoni, P. Lai, R. Kettimuthu and D. K. Panda, High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand, Int'l Symposium on Cluster Computing and the Grid (CCGrid), May 2010. 136CCGrid '11
  • 137. • Low-level Network Performance • Clusters with Message Passing Interface (MPI) • Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB) • InfiniBand in WAN and Grid-FTP • Cloud Computing: Hadoop and Memcached CCGrid '11 137 Case Studies
  • 138. A New Approach towards OFA in Cloud Current Approach Towards OFA in Cloud Application Accelerated Sockets 10 GigE or InfiniBand Verbs / Hardware Offload Current Cloud Software Design Application Sockets 1/10 GigE Network Our Approach Towards OFA in Cloud Application Verbs Interface 10 GigE or InfiniBand • Sockets not designed for high-performance – Stream semantics often mismatch for upper layers (Memcached, Hadoop) – Zero-copy not available for non-blocking sockets (Memcached) • Significant consolidation in cloud system software – Hadoop and Memcached are developer facing APIs, not sockets – Improving Hadoop and Memcached will benefit many applications immediately! CCGrid '11 138
  • 139. Memcached Design Using Verbs • Server and client perform a negotiation protocol – Master thread assigns clients to appropriate worker thread • Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread • All other Memcached data structures are shared among RDMA and Sockets worker threads Sockets Client RDMA Client Master Thread Sockets Worker Thread Verbs Worker Thread Sockets Worker Thread Verbs Worker Thread Shared Data Memory Slabs Items … 1 1 2 2 CCGrid '11 139
  • 140. Memcached Get Latency • Memcached Get latency – 4 bytes – DDR: 6 us; QDR: 5 us – 4K bytes -- DDR: 20 us; QDR:12 us • Almost factor of four improvement over 10GE (TOE) for 4K • We are in the process of evaluating iWARP on 10GE Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) 0 50 100 150 200 250 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Time(us) Message Size 0 50 100 150 200 250 300 350 1 2 4 8 16 32 64 128256512 1K 2K 4K Time(us) Message Size SDP IPoIB OSU Design 1G 10G-TOE CCGrid '11 140
  • 141. Memcached Get TPS • Memcached Get transactions per second for 4 bytes – On IB DDR about 600K/s for 16 clients – On IB QDR 1.9M/s for 16 clients • Almost factor of six improvement over 10GE (TOE) • We are in the process of evaluating iWARP on 10GE Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) 0 200 400 600 800 1000 1200 1400 1600 1800 2000 8 Clients 16 Clients SDP IPoIB OSU Design 0 100 200 300 400 500 600 700 8 Clients 16 Clients ThousandsofTransactionspersecond (TPS) SDP 10G - TOE OSU Design CCGrid '11 141
  • 142. Hadoop: Java Communication Benchmark • Sockets level ping-pong bandwidth test • Java performance depends on usage of NIO (allocateDirect) • C and Java versions of the benchmark have similar performance • HDFS does not use direct allocated blocks or NIO on DataNode S. Sur, H. Wang, J. Huang, X. Ouyang and D. K. Panda “Can High-Performance Interconnects Benefit Hadoop Distributed File System?”, MASVDC „10 in conjunction with MICRO 2010, Atlanta, GA. 0 200 400 600 800 1000 1200 1400 Bandwidth(MB/s) Message Size (Bytes) Bandwidth with C version 0 200 400 600 800 1000 1200 1400 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Bandwidth(MB/s) Message Size (Bytes) Bandwidth Java Version 1GE IPoIB SDP 10GE-TOE SDP-NIO CCGrid '11 142
  • 143. Hadoop: DFS IO Write Performance • DFS IO included in Hadoop, measures sequential access throughput • We have two map tasks each writing to a file of increasing size (1-10GB) • Significant improvement with IPoIB, SDP and 10GigE • With SSD, performance improvement is almost seven or eight fold! • SSD benefits not seen without using high-performance interconnect! • In-line with comment on Google keynote about I/O performance Four Data Nodes Using HDD and SSD 0 10 20 30 40 50 60 70 80 90 1 2 3 4 5 6 7 8 9 10 AverageWriteThroughput(MB/sec) File Size(GB) 1GE with HDD IGE with SSD IPoIB with HDD IPoIB with SSD SDP with HDD SDP with SSD 10GE-TOE with HDD 10GE-TOE with SSD CCGrid '11 143
  • 144. Hadoop: RandomWriter Performance • Each map generates 1GB of random binary data and writes to HDFS • SSD improves execution time by 50% with 1GigE for two DataNodes • For four DataNodes, benefits are observed only with HPC interconnect • IPoIB, SDP and 10GigE can improve performance by 59% on four DataNodes 0 100 200 300 400 500 600 700 2 4 ExecutionTime(sec) Number of data nodes 1GE with HDD IGE with SSD IPoIB with HDD IPoIB with SSD SDP with HDD SDP with SSD 10GE-TOE with HDD 10GE-TOE with SSD CCGrid '11 144
  • 145. Hadoop Sort Benchmark • Sort: baseline benchmark for Hadoop • Sort phase: I/O bound; Reduce phase: communication bound • SSD improves performance by 28% using 1GigE with two DataNodes • Benefit of 50% on four DataNodes using SDP, IPoIB or 10GigE 0 500 1000 1500 2000 2500 2 4 ExecutionTime(sec) Number of data nodes 1GE with HDD IGE with SSD IPoIB with HDD IPoIB with SSD SDP with HDD SDP with SSD 10GE-TOE with HDD 10GE-TOE with SSD CCGrid '11 145
  • 146. • Introduction • Why InfiniBand and High-speed Ethernet? • Overview of IB, HSE, their Convergence and Features • IB and HSE HW/SW Products and Installations • Sample Case Studies and Performance Numbers • Conclusions and Final Q&A CCGrid '11 Presentation Overview 146
  • 147. • Presented network architectures & trends for Clusters, Grid, Multi-tier Datacenters and Cloud Computing Systems • Presented background and details of IB and HSE – Highlighted the main features of IB and HSE and their convergence – Gave an overview of IB and HSE hardware/software products – Discussed sample performance numbers in designing various high- end systems with IB and HSE • IB and HSE are emerging as new architectures leading to a new generation of networked computing systems, opening many research issues needing novel solutions CCGrid '11 Concluding Remarks 147
  • 148. CCGrid '11 Funding Acknowledgments Funding Support by Equipment Support by 148
  • 149. CCGrid '11 Personnel Acknowledgments Current Students – N. Dandapanthula (M.S.) – R. Darbha (M.S.) – V. Dhanraj (M.S.) – J. Huang (Ph.D.) – J. Jose (Ph.D.) – K. Kandalla (Ph.D.) – M. Luo (Ph.D.) – V. Meshram (M.S.) – X. Ouyang (Ph.D.) – S. Pai (M.S.) – S. Potluri (Ph.D.) – R. Rajachandrasekhar (Ph.D.) – A. Singh (Ph.D.) – H. Subramoni (Ph.D.) Past Students – P. Balaji (Ph.D.) – D. Buntinas (Ph.D.) – S. Bhagvat (M.S.) – L. Chai (Ph.D.) – B. Chandrasekharan (M.S.) – T. Gangadharappa (M.S.) – K. Gopalakrishnan (M.S.) – W. Huang (Ph.D.) – W. Jiang (M.S.) – S. Kini (M.S.) – M. Koop (Ph.D.) – R. Kumar (M.S.) – S. Krishnamoorthy (M.S.) – P. Lai (Ph. D.) – J. Liu (Ph.D.) – A. Mamidala (Ph.D.) – G. Marsh (M.S.) – S. Naravula (Ph.D.) – R. Noronha (Ph.D.) – G. Santhanaraman (Ph.D.) – J. Sridhar (M.S.) – S. Sur (Ph.D.) – K. Vaidyanathan (Ph.D.) – A. Vishnu (Ph.D.) – J. Wu (Ph.D.) – W. Yu (Ph.D.) 149 Current Research Scientist – S. Sur Current Post-Docs – H. Wang – J. Vienne – X. Besseron Current Programmers – M. Arnold – D. Bureddy – J. Perkins Past Post-Docs – E. Mancini – S. Marcarelli – H.-W. Jin
  • 150. CCGrid '11 Web Pointers http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~surs http://nowlab.cse.ohio-state.edu MVAPICH Web Page http://mvapich.cse.ohio-state.edu panda@cse.ohio-state.edu surs@cse.ohio-state.edu 150