SlideShare a Scribd company logo
1 of 76
Download to read offline
© 2014 Mellanox Technologies 1- Mellanox Confidential -
Oded Paz
April 2014
InfiniBand Essentials Every HPC Expert Must Know
Mellanox Training Center 2Training Material
 IB Principles
• Targets
• Fabric Components
• Fabric architecture
 Fabric Discovery Stages
• Topology discovery
• Information Gathering
• Forwarding Tables
• Fabric SDN
• Fabric Activation
 Protocol Layers Principle
• Supported Upper Layer protocols
• Transport layer
• Link Layer
• Physical Layer
 Mellanox Products
• InfiniBand Switches
• Channel Adapters
• Cabling
• Fabric Management
HPC
Mellanox Training Center 3Training Material
Leading Supplier of End-to-End Interconnect Solutions
Host/Fabric SoftwareICs Switches/GatewaysAdapter Cards Cables
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
Virtual Protocol Interconnect
Storage
Front / Back-End
Server / Compute Switch / Gateway
56G IB & FCoIB 56G InfiniBand
10/40/56GbE &
FCoE
10/40/56GbE
Fibre Channel
Virtual Protocol Interconnect
Mellanox Training Center 4Training Material
Mellanox Common Target Implementations
Web 2.0 DB/EnterpriseHPC
Up to 10X
Performance and
Simulation Runtime
33% Higher GPU
Performance
Unlimited
Scalability
Lowest Latency
62% Better
Execution Time
42% Faster Messages
Per Second
Cloud
12X More
Throughput
Support More
Users at Higher
Bandwidth
Improve and
Guarantee SLAs
10X Database Query
Performance
4X Faster VM
Migration
More VMs per
Server and More
Bandwidth per VM
Storage
2X Hadoop
Performance
13X Memcached
Performance
4X Price/
Performance
Mellanox storage acceleration software provides >80%
more IOPS (I/O operations per second)
Financial Services
Mellanox Training Center 5Training Material
Mellanox VPI Interconnect Solutions
Mezzanine Card
VPI Adapter
Ethernet: 10/40 Gb/s
InfiniBand:10/20/40/
56 Gb/s
Networking Storage Clustering Management
Applications
Acceleration Engines
LOM Adapter Card
3.0
64 ports 10GbE
36 ports 40GbE
48 10GbE + 12 40GbE
36 ports IB up to 56Gb/s
8 VPI subnets
Switch OS Layer
Unified Fabric Manager
VPI Switch
Mellanox Training Center 6Training Material
Leading Interconnect, Leading Performance
Latency
5usec
2.5usec
1.3usec
0.7usec
<0.5usec
160/200Gb/s
100Gb/s
56Gb/s
40Gb/s
20Gb/s
10Gb/s
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
Bandwidth
Same Software Interface
0.5usec
SDR
DDR
QDR
FDR &FDR 10
EDR
NDR
Mellanox Training Center 7Training Material
 Founded in 1999
 Actively markets and promotes InfiniBand from an industry perspective through public relations
engagements, developer conferences and workshops
 InfiniBand software is developed under OpenFabrics Open Source Alliance
http://www.openfabrics.org/index.html
 InfiniBand standard is developed by the InfiniBand Trade Association (IBTA)
http://www.infinibandta.org/home
Steering Committee Members:
InfiniBand Trade Association (IBTA)
Mellanox Training Center 8Training Material
InfiniBand is a Switch Fabric Architecture
 Interconnect technology connecting CPUs and I/O
 Super high performance
• High bandwidth (starting at 10Gb/s and up to 100Gb/s)
• Low latency– fast application response across the cluster < 1µs end to end
( Mellanox switches 170 nanosec per HOP )
• Low CPU utilization with RDMA (Remote Direct Memory Access) –
Unlike Ethernet, TRAFFIC communication bypasses the OS and the CPU’s.
First industry standard high speed interconnect!
Mellanox Training Center 9Training Material
 InfiniBand was originally designed for large-scale grids and clusters
 Increased application performance
 Single port solution for all LAN, SAN, and application communication
 High reliability CLUSTER management (Redundant Subnet Manager)
 Automatic Cluster switches and ports configuration performed by the Subnet Manager SW
InfiniBand is a Switch Fabric Architecture
First industry-standard high speed interconnect!
Mellanox Training Center 10Training Material
RDMA – How Does it Work
RDMA over InfiniBand
KERNELHARDWAREUSER
RACK 1
OS
NIC Buffer 1
Application
1
Application
2
OS
Buffer 1
NICBuffer 1
TCP/IP
RACK 2
HCA HCA
Buffer 1Buffer 1
Buffer 1
Buffer 1
Buffer 1
Mellanox Training Center 11Training Material
 Industry-standard defined by the InfiniBand Trade Association
 Defines System Area Network architecture
• Comprehensive specification: from physical to applications
 Architecture supports
• Host Channel Adapters (HCA)
• Switches
• Routers
The InfiniBand Architecture
Processor
Node
InfiniBand
Subnet
Gateway
HCA
Processor
Node
Processor
Node
HCA
HCA
Storage
Subsystem
Consoles
RAID
Ethernet
Gateway
Fibre Channel
HCA
Subnet
Manager
•Switch
Switch
Switch
Switch
Mellanox Training Center 12Training Material
 Host Channel Adapter (HCA)
• Device that terminates an IB link and
executes transport-level functions and
support the verbs interface
 Switch
• A device that moves packets from one
link to another of the same IB Subnet
 Router
• A device that transports packets
between different IBA subnets
 Bridge/Gateway
• InfiniBand to Ethernet
InfiniBand Components Overview
Mellanox Training Center 13Training Material
 Equivalent to a NIC (Ethernet)
- GUID Global Unique ID
 Converts PCI to InfiniBand
 CPU offload of transport operations
 End-to-end QoS and congestion control
 HCA bandwidth options:
• Single Data Rate 2.5GB/S * 4 = 10
• Double Data Rate 5 GB/S * 4 = 20
• Quadruple Data Rate 10GB/S * 4 = 40
• Fourteen Data Rate 14 Gb/s * 4 = 56
• Enhanced Data Rate 25 Gb/s * 4 = 100
Host Channel Adapters (HCA)
Mellanox Training Center 14Training Material
 Any InfiniBand node requires GUID&LID addresses
 GUID (Global Unique Identifier)- 64 bits address, “Like a Ethernet MAC address”
• Assigned by IB vendor
• Persistent through reboots
 IB Switch “Multiple” Address GUIDS
• Node = Is meant to identify the HCA as a entity
• Port = Identifies the port as a port
• System = Allows to combine multiple GUIDS creating one entity
Global Unique Identifier (GUID) – Physical Address
Mellanox Training Center 15Training Material
 A single 36 ports IB switch chip, is the Basic
Block for every IB switch module
 We create a multiple ports switching module
using multiple chips
 In this example we create 72 ports
switch, using 6 identical chips:
• 4 chips will function as lines
• 2 chips will function as core
The IB Fabric Basic Building Block
Mellanox Training Center 16Training Material
IB Fabric L2 Switching Addressing Local Identifier (LID)
 Local Identifier- 16 bit L2 Address
• Assigned by the Subnet Manager when port becomes active
• Not persistent through reboots
 LID Address Ranges
• 0x 0000 = Reserved
• 0x0001 = 0xBFFF = Unicast
• 0xc001 = 0xFFFE = Multicast
• 0xFFFF = Reserved for special use
Mellanox Training Center 17Training Material
 Define different partitions for different customers
 Define different partitions for different applications
 Allows fabric partitioning for security purposes
 Allows fabric partitioning for Quality of Service (QoS)
 Each partition has an Identifier named PKEY
InfiniBand Network Segmentation – Partitions
PKEY ID 2
Sevice Level 3
PKEY ID 3
Sevice Level 3
PKEY ID 3
Sevice Level 1
Mellanox Training Center 18Training Material
 Usage
• A 128 bit field in the Global Routing Header (GRH) used to route packets between different IB
subnets
• Multicast groups port identifier IB & IPOIB
 Structure
• GUID- 64 bit identifier provided by the manufacturer
• IPv6 type header
• Subnet Prefix: A 0 to 64-bit:
- Identifier used to uniquely identify a set of end-ports which are managed by a common Subnet
Manager
GID - Global Identifier
port GUID: 0x0002c90300455fd1
fe80:0000:0000:0x0002c90300455fd1default gid:
Mellanox Training Center 19Training Material
 Node: any managed entity– End Node, Switch, Router
 Manager: active entity; sources commands and queries
• The subnet manager (SM)
 Agent: passive (mostly) entity that will reside on every node, responds to Subnet Managers queries
 Management Datagram (MAD):
• Standard message format for manager–agent communication
• Carried in an unreliable datagram (UD)
IB Basic Management Concepts
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Mellanox Training Center 20Training Material
Objectives of Subnet Management
 Initialization and configuration of the subnet elements
 Establishing best traffic paths between source to destination through the subnet
 Fault isolation
 Continue these activities during topology changes
 Prevent unauthorized Subnet Managers
Mellanox Training Center 21Training Material
Node & Switch Main identifiers
IB Port Basic Identifiers
• Host Channel Adapter– HCA (IB “NIC” )
• Port number
• Global Universal ID– GUID 64 bit (like mac) ex. 00:02:C9:02:00:41:38:30
- Each 36 ports “basic “ switch has its own switch & system GUID
- All ports belong to the same “basic “ switch will share the switch GUID
• Local Identifier - LID
 LID
• Local Identifier that is assigned to any IB device by the SM and used for packets switching within an IB fabric .
• All ports of the same ASIC unit are using the same LID
Mellanox Training Center 22Training Material
Node & Switch Main identifiers
 Virtual Lane
• Each Virtual Lane uses different buffers to send its packet towards the other side
• VL 15 is used for management only SM traffic
• VL 0-7 are used for traffic
• Used to separate different bandwidth & QoS using same physical port
Mellanox Training Center 23Training Material
De-Mux
Mux
Traffic Packets VL 0-7
Packets Transmitted
Link Control
VL-15
Traffic Packets VL 0-7
Node & Switch Main identifiers
Mellanox Training Center 24Training Material
Subnet Manager Cluster Discovery
Mellanox Training Center 25Training Material
1. Physical Subnet Establishment
2. Subnet Discovery
3. Information Gathering
4. LID Assignment
5. Path Establishment
6. Port Configuration
7. Switch Configuration
8. Subnet Activation
Subnet Manager & Fabric configuration Process
Mellanox Training Center 26Training Material
Subnet Manager (SM) Rules & Roles
 Every subnet must have at least one
- Manages all elements in the IB fabric
- Discover subnet topology
- Assign LIDs to devices
- Calculate and program switch chip forwarding tables (LFT pathing)
- Monitor changes in subnet
 Implemented anywhere in the fabric
- Node, Switch, Specialized device
 No more than one active SM allowed
- 1 Active (Master) and remaining are Standby (HA)
Mellanox Training Center 27Training Material
1. The SM wakes up and starts the Fabric Discovery
process
2. The SM starts “ conversation “ with every node ,
over the InfiniBand link it is connected to .
in this stage the discovery stage, the SM collects :
• Switch Information followed by port information
• Host information
3. Any switch which is already discovered , will be used
as a gate for the SM , for further discovery of all this
switch links and the switches it is connected to
known also as its neighbors.
Fabric Discovery (A)
Mellanox Training Center 28Training Material
4. The SM gathers information by sending and receiving SMPs (Subnet Management Packets)
a. These special management packets are sent on Virtual Lane 15 (VL15)
• VL15 is a special NON flow controlled VL
b. Two primary “types” of SMPs creating Cluster
routing table:
• Directed routing (DR) table based on
Nodes GUIDS & port number
• This is the type primarily used by OpenSM
c. LID routing (LR)
• Topology and than packets routing table ,
Based on the LIDS which have been assigned to each node by the SM
Fabric Discovery (B)
Mellanox Training Center 29Training Material
 Node Info Gathered
• Node type
• Num of ports
• GUID
• Partition table size
 Port Info Gathered
• Forwarding Database size
• MTU
• Width
• VLs
Fabric Information Gathering During Discovery
Mellanox Training Center 30Training Material
Fabric Direct Route Information Gathering
 Building the direct routing table
from & to each one of the fabric elements
 Each node in a path is identified by its port number & GUID
 The table content is saved in the SM LMX table
Switc
h-1
Switch-2 Switch-5 Switch-3 Switch-6 Switch-4 H-11 H-16
Switc
h-1
Port2 Port2
Switch2
Port2
Port 3 Port 3
Switch 3
Port 5
Port 8 Port 3
Switch
3_Port5
Switch
7_Port29
Port 8
Switch
4_Port 30
Switch
3_Port 5
Switch
6_Port 29
Switc
h-1
Port 8
Switch
5_Port9
Switch
3_Port5
H11 Port 1
Switch
6_Port2
Switch
3_Port4
Switch
4_Port30
Port 1
Switch
6_Port2
Switch
3_Port1
Switch1_P
ort2
Switch2_P
ort2
Port 1
Switch6_P
ort2
Port 1 Port 1
Switch
6_Port2
Switch
3_Port4
Port 1
Switch
6_Port 2
Switch
3_Port 4
Switch
4_Port 30
Mellanox Training Center 31Training Material
LID Assignment
 After the SM finished gathering
any needed subnet information, it assigns a base LID and LMC
to each one of the attached end ports
• The LID is assigned to at the port rather than device level
• Switch external ports do not get/need LIDs
 The DLID is used as the main address for InfiniBand packet
switching
 Each Switch port can be identified by the combination
of LID & port number
LID 3
LID 21 LID 22 LID 23
LID 72 LID 75 LID 82
LID 81
Mellanox Training Center 32Training Material
Linear Forwarding Table Establishment (Path Establishment)
 After the SM finished gathering
all Fabric information , including direct route tables ,
it assigns a LID to each one of the NODES
 At this stage the LMX table will be populated with the relevant route
options to each one of the nodes
 The output of the LMX will provide the Best Route
to Reach a DLID as well as the other Routes .
 The Best Path Result Will be based on Shortest Path First (SPF)
algorithm
21 1 2 3 1
22 2 1 2 1
23 3 2 1 1
75 3 2 3 2
81 4 3 4 3
82 4 3 2 2
The
Dest.
LID
Best
Route/
exit
port
21 2
22 3
23 8
75 3
81 3
82 8
D-LID
PORT
LID 3
LID 21 LID 22 LID 23
LID 72 LID 75 LID 82
LID 81
Mellanox Training Center 33Training Material
LID Routed (LR) Forwarding
 Uses the LFT tables
 Based on the data gathered on the LMX – Direct Routing
 It is the standard routing of packets used by switches
 Uses regular link-level headers to define destination and other
information, such as:
• DLID = LID of the final destination
• SL = Service Level of the path
• Each switch uses the forwarding table and SL to VL table to decide on
the packet’s output port/VL
LFT Switch_1
The
Destination
LID
Best
Route/
exit port
21 2
22 3
23 8
75 3
81 3
82 8
Mellanox Training Center 34Training Material
 LRH: Local Routing Header :
• Source & Destination LID
• Service Level-SL
• Virtual Lane-VL
• Packet Length
LID Routed (LR) Forwarding
LFT Switch_1
The
Dest.
LID
Best
Route/
exit
port
21 2
22 3
23 8
75 3
81 3
82 8
LRH GRH BTH Ext
HDRs
Playload ICRC VCRC
InfiniBand Data Packet
8B 40B 12B Var 0…4096B 4B 2B
Mellanox Training Center 35Training Material
 Light sweep :
• Routine sweep of the Subnet Manager
• By default runs every 30 second
• Requires all switches to switch and port info
 Light Sweep traces :
• Ports status change
• New SM speaks on the subnet
• Subnet Manager changes priority
Tracking FABRIC STATUS – SM Sweeps
LID 3
LID 21 LID 22 LID 23
LID 72 LID 75 LID 82
LID 81
Mellanox Training Center 36Training Material
 Any change traced by the light sweep will cause Heavy Sweep
 IB TRAP
• Changes of status of a switch will cause an on line IB TRAP
that will be sent to the Subnet Manager and cause Heavy Sweep
 Heavy Sweep
• Will cause all SM fabric discovery to be performed from scratch
Tracking FABRIC STATUS – SM Sweeps
Mellanox Training Center 37Training Material
InfiniBand Fabric Topologies
Mellanox Training Center 38Training Material
InfiniBand Fabric Commonly Used Topologies
Back to Back
2 Tier Fat Tree
Modular switches are based on Fat Tree architecture:
3D Torus Dual Star Hybrid
Mellanox Training Center 39Training Material
The IB Fabric Basic Building Block
 A single 36 ports IB switch chip, is the Basic
block for every IB switch module
 We create a multiple ports switching Module
using multiple chips
 In this example we create 72 ports
switch, using 6 identical chips
• 4 chips will function as lines
• 2 chips will function as core
Mellanox Training Center 40Training Material
CLOS Topology
 Pyramid Shape Topology
 The switches at the top of the pyramid are called Spines/Core
• The Core/Spine switches are interconnected to the other switch environments
 The switches at the bottom of the Pyramid are called Leafs/Lines/Edges
• The Leaf/Lines/Edge are connected to the fabric nodes/hosts
 In a non blocking CLOS fabric there are equal number of external and internal connections
Internal Connections Spines/Core
External Connections Leaf/Line/Edge
Mellanox Training Center 41Training Material
 External connections :
• The connections between the hosts and the Line switches
 Internal Connections
• The connections between the core and the Line switches
 In a non blocking fabric there is always a balanced cross
bisectional bandwidth
 In case the number of external connections is higher than internal connections, we have a blocking configuration
CLOS Topology
Internal Connections Spines/Core
External Connections Leaf/Line/Edge
Mellanox Training Center 42Training Material
CLOS - 3
 The topology detailed here is called CLOS 3
 The maximum traffic path between source to destination
includes 3 HOPS (3 switches)
 Example a session between A to B
• One Hop from A to switch L1-1
• Next Hop from switch L1-1 to switch L2-1
• Last Hop from L2-1 to L1-4
Spines/Core
Leaf/Line/Edge
Mellanox Training Center 43Training Material
 In this example we can see 108 non blocked fabric
• 108 hosts are connected to the line switches
• 108 links connect between the line switches to the core switches to enable non blocking
interconnection of the line switches
CLOS - 3
18*6=108
Mellanox Training Center 44Training Material
IB Fabric Protocol Layers
Mellanox Training Center 45Training Material
IB Switch - L2
Upper Level
Protocols
Transport
Layer
Network
Layer
Link Layer
Physical
Layer
Client
Transactions
Messages
Que Pairs
Inter Subnet Routing
End Node Switch L2 End Node
L2 Switching LID Based
Client
IBA
Operations
SAR
IBA
Operations
SAR
Network Network
Link
Encoding
Media
Access
Control
Link
Encoding
Media
Access
Control
Packet
Relay
MAC
MAC
Mellanox Training Center 46Training Material
IB Architecture Layers
 Software Transport Verbs and Upper Layer Protocols:
- Interface between application programs and hardware.
- Allows support of legacy protocols such as TCP/IP
- Defines methodology for management functions
 Physical:
- Signal levels and frequency, media, connectors
 Transport:
- Delivers packets to the appropriate Queue Pair;
Message Assembly/De-assembly, access rights, etc.
 Data Link (symbols and framing):
- From source to destination on the same partition subnet
Flow control (credit-based); How packets are routed
 Network:
- How packets are routed between different partitions/subnets
Client
Transactions
Messages
Que Pairs
Inter Subnet Routing
End Node Switch L2 End Node
L2 Switching LID Based
Client
IBA
Operations
SAR
IBA
Operations
SAR
Network Network
Link
EncodingMedia
Access
Control
Link
EncodingMedia
Access
Control
Packet
Relay
MAC
MAC
Mellanox Training Center 47Training Material
InfiniBand Header Structure
Upper Layer Protocol Transactions
Messages
Que Pairs
Inter Subnet Routing
End Node Switch L2 End Node
L2 Switching LID Based
Network
Link Layer
Packet
Relay
MAC
MAC
Transport
Original Message
SAR SAR SAR
SAR
Packets
Subnet Prefix +GUID
Network
Link Layer
Transport
SAR SAR SAR
SAR
Physical
Layer
Physical
Layer
Upper Layer Protocol
Original Message
Mellanox Training Center 48Training Material
 The Network and Link protocols deliver a packet to the desired destination.
 The Transport Layer
• Segmenting Assembly & Reassembly of :
-Messages data payload coming from the Upper Layer, into multiple packets that will suit valid
MTU size
• Delivers the packet to the proper Queue Pair (assigned to a specific session )
• Instructs the QP how to process the packet’s data (Work Request Eelement )
• Reassembles the packets arriving from the other side into messages
Transport Layer – Responsibilities
Transmit
Receive
WQE
Local QP
Mellanox Training Center 49Training Material
Upper Layer Protocol Transactions
Messages
Que Pairs
Inter Subnet Routing
End Node Switch L2 End Node
L2 Switching LID Based
Network
Link Layer Packet
Relay
MAC
MAC
Transport
Original Message
SAR SAR SAR
SAR
Packets
Subnet Prefix +GUID
Network
Link Layer
Transport
SAR SAR SAR
SAR
Physical
Layer
Physical
Layer
Upper Layer Protocol
Original Message
Transport Layer – Responsibilities
Mellanox Training Center 50Training Material
 Switches use FDB (Forwarding Database)
• Based on DLID and SL, a packet is sent to the correct output port +specific VL
Layer 2 Forwarding
RX
Inbound Packet
Outbound Packet
link
RX
TX
RX
TX
SL DLID Payload
FDB
(DLID to Port)
SL to VL
Table
Mellanox Training Center 51Training Material
Arbitration
De-
mux
Mux
Link
Control
Packets
Credits
Returned
Link Control
Receive
BuffersPackets
Transmitted
 Credit-based link-level flow control
• Link Flow Control assures NO packet loss within fabric even in the presence of congestion
• Link Receivers grant packet receive buffer space credits per Virtual Lane
• Flow Control credits are issued in 64 byte units
 Separate flow control per Virtual Lanes provides:
• Alleviation of head-of-line blocking
 Virtual Fabrics
Congestion and latency on one VL, does not impact traffic with guaranteed QoS on another VL,
even though they share the same Physical link
Link Layer – Flow Control
Mellanox Training Center 52Training Material
 InfiniBand is a Lossless fabric.
 Maximum Bit Error Rate (BER) allowed by the IB spec is 10e-12.
Statistically Mellanox fabrics provides around 10e-15
 The Physical layer should guaranty affective signaling to meet this BER requirement
Physical Layer- Responsibilities
Mellanox Training Center 53Training Material
 Industry standard Media types
• Copper: 7 Meter QDR , 3 METER FDR
• Fiber: 100/300m QDR & FDR
 64/66 encoding on FDR links
• Encoding makes it possible to send digital high speed signals to a longer distance enhances
performance & bandwidth effectiveness
• X actual data bits are sent on the line by Y signal bits
• 64/66 * 56 = 54.6Gbps
 8/10 bit encoding (DDR and QDR)
• X/Y line efficiency (example 80% * 40 = 32Gbps)
Physical Layer Cont
4X QSFP Fiber 4X QSFP Copper
Mellanox Training Center 54Training Material
 Mellanox cables are rebranded from a cable vendor
• Mellanox cables are manufactured by Mellanox
 Our vendor can sell the same cables
• No other vendor is allowed to sell Mellanox cables
 Mellanox cables use a different assembly procedure
 Mellanox cables are tested with unique test suite
 Vendors’ “Finished Goods” fail Mellanox dedicated testing
 Mellanox allows the customers to use any IBTA IB approved cables
Mellanox Cables – Perceptions Vs. Facts
Passive Copper Cables SFP+
Active Optical Cables
Active Copper Cables
Mellanox Training Center 55Training Material
 Superior design and qualification process
 Committed to Bit Error Rate (BER), better than 10-15
 Longest reach with Mellanox end-to-end solution
Mellanox Passive Copper Cables
Data Rate PCC Max Reach
FDR 3 meter
FDR10 5 meter
QDR 7 meter
40GbE 7 meter
10GbE 7 meter
Mellanox Training Center 56Training Material
 Superior design and qualification process
 Committed to Bit Error Rate (BER), better than 10-15
 Longest reach with Mellanox end-to-end solution
 Optical Performance Optimization (patent pending)
Mellanox Active Fiber Cables
Data Rate Max Reach
FDR 300 meter
FDR10 100 meter
QDR 300 meter
40GbE 100 meter
Mellanox Training Center 57Training Material
Mellanox Family Products
Mellanox Training Center 58Training Material
Leading Supplier of End-to-End Interconnect Solutions
Host/Fabric SoftwareICs Switches/GatewaysAdapter Cards Cables
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
Virtual Protocol Interconnect
Storage
Front / Back-End
Server / Compute Switch / Gateway
56G IB & FCoIB 56G InfiniBand
10/40/56GbE &
FCoE
10/40/56GbE
Fibre Channel
Virtual Protocol Interconnect
Mellanox Training Center 59Training Material
InfiniBand Switches & Gateways
Mellanox Training Center 60Training Material
FDR InfiniBand Switch Portfolio
648 port 324 port 216 port 108 port
Modular Switches Edge Switches
SX6025 – 36 ports externally managed SX6036 – 36 ports managed
SX6018 – 18 ports managed
Management
SX6015 – 18 ports externally managed
SX6012 – 12 ports managedSX6005 – 12 ports externally managed
Long Distance Bridge − VPI
Bridging
Routing
NEW NEW
NEW
NEW NEW
NEW
Mellanox Training Center 61Training Material
SwitchX® VPI Technology Highlights
VPI per Port
Same box runs
InfiniBand AND Ethernet
VPI on Box
Same box runs
InfiniBand OR Ethernet
VPI Bridging
Same box bridges
InfiniBand AND Ethernet
Bridging
Routing
3
InfiniBand
2
Virtual Protocol Interconnect ® (VPI)
One Switch – Multiple Technologies
1
Ethernet InfiniBand Ethernet
Mellanox Training Center 62Training Material
 Provides InfiniBand and Ethernet Long-Haul Solutions
of up to 80km for campus and metro applications.
 Connecting between data centers deployed across
multiple geographically distributed sites
 Extending InfiniBand RDMA and Ethernet RoCE
beyond local data centers and storage clusters.
 Perfect cost-effective, low power, easily managed and
scalable solution
 Managed as a single unified network infrastructure.
MetroX™ - Mellanox Long-Haul Solutions
Mellanox Training Center 63Training Material
MetroDX and MetroX Features
TX6000 TX6100 TX6240 TX6280
Distance 1KM 10KM 40KM 80KM
Throughput 640Gb/s 240Gb/s 80Gb/s 40Gb/s
Port Density
16p X FDR10 long haul
16p X FDR downlink
6p X 40Gb/s long haul
6p X 56Gb/s downlink
2p X 10/40Gb/s long haul
2p X 56Gb/s downlink
1p X 10/40Gb/s long haul
1p X 56Gb/s downlink
Latency 200ns + 5us/km over fiber 200ns + 5us/km over fiber 700ns + 5us/km over fiber 700ns + 5us/km over fiber
Power ~200W ~200W ~280W ~280W
QoS One data VL + VL15 One data VL + VL15 One data VL + VL15 One data VL + VL15
Space 1RU 1RU 2RU 2RU
Mellanox Training Center 64Training Material
Mellanox Host Channel Adapters (HCA)
Reference to the following Document :
ConnectX®-3 VPI Single and Dual QSFP Port Adapter Card User Manual
http://www.mellanox.com/page/products_dyn?product_family=119&mtag=connectx_3_vpi
Mellanox Training Center 65Training Material
 Up to 56Gb/s InfiniBand or 40 Gigabit Ethernet per port
 PCI Express 3.0 (up to 8GT/s)
 CPU offload of transport operations
 Application offload
 GPU communication acceleration
 End-to-end QoS and congestion control
 Hardware-based I/O virtualization
 Dynamic power management
 Fiber Channel encapsulation (FCoIB or FCoE)
 Ethernet encapsulation (EoIB)
HCA ConnectX-3 InfiniBand Main Features
VirtualizationDatabase Cloud ComputingHPC
Mellanox Training Center 66Training Material
Adapters offering
ConnectX-3 Pro
NVGRE and VxLAN HW off-load
RoCE V2(UDP)
ECNQCN
ConnectX-3
VPI
Up to 56Gb IB
Up to 56 GbE
RDMA
CPU off-load
SR-IOV
Connect-IB
Up to 56Gb IB
Greater than 100Gb bi-directional
DC
T10DIF
PCIE x16
More than 130M message/sec
Mellanox Training Center 67Training Material
Fabric Management
Mellanox Training Center 68Training Material
 Enable fast cluster bring-up
• Point out issues with devices, systems, cables
• Provide inventory including cables, devices, FW, SW
• Perform device specific (proprietary) checks
• Eye-Opening and BER checks
• Catch cabling mistakes
 Validate Subnet Manager work
• Verify connectivity at the lowest level possible
• Report subnet configuration
• SM agnostic
Goal of Fabric Utilities in HPC Context
Mellanox Training Center 69Training Material
 Diagnose L2 communication failures
• At the entire subnet level
• On a point to point path
 Monitor the Network Health
• Continuous and with low overhead
Goal of Fabric Utilities in HPC Context
Mellanox Training Center 70Training Material
 ibutils: ibdiagnet/ibdiagpath
• An automated L2 health analysis procedure
• Text interface
• No dedicated “monitoring” mode
• Significant development past year on features and runtime performance at scale
 UFM
• Highend monitoring and Provisioning capabilities
• GUI based with CLI options
• Includes ibutils capabilities with additional features
• Central device management
- Fabric dashboard
- Congestion analysis
• System Integration Capabilities
- SMP Traps and Alarms
Fabric Management Solution Overview
Mellanox Training Center 71Training Material
UFM in the Fabric
 Software or Appliance form factor
 2 or more High Availability
 Switch and HCA management
 Full Mgmt or Monitoring Only modes
Synchronization, Heartbeat
Mellanox Training Center 72Training Material
Multi-Site Management
Site1 Site2 Site n. . .
Mellanox Training Center 73Training Material
 Extensible architecture
• Based on Web-services
 Open API for users or 3rd-party extensions
• Allows simple reporting, provisioning, and monitoring
• Task automation
• Software Development Kit
 Extensible object model
• User-defined fields
• User-defined menus
Integration with 3rd Party Systems
Web Based API
Read
Write
Manage
Monitoring System
Configuration Mgmt
Orchestrator
Job Scheduler…
Alerts via
SNMP Traps
Web Based
API
Mellanox Training Center 74Training Material
UFM – Comprehensive Robust Management
Automatic
Discovery
Central Device
Management
Fabric
Dashboard
Congestion
Analysis
Health & Performance
Monitoring
Service Oriented
Provisioning
Fabric Health
Reports
Mellanox Training Center 75Training Material
UFM Main Features
Automatic Discovery Central Device Mgmt Fabric Dashboard Congestion Analysis
Health & Perf Monitoring Advanced Alerting Fabric Health Reports Service Oriented Provisioning
Mellanox Training Center 76Training Material
 Monitor & analyze fabric performance
• Bandwidth utilization
• Unique congestion monitoring
• Dashboard for aggregated fabric view
 Real-time fabric-wide health monitoring
• Monitor events and errors through-out the fabric
• Threshold based alarms
• Granular monitoring of host and switch parameters
 Innovative congestion mapping
• One view for fabric-wide congestion and traffic patterns
• Enables root cause analysis for routing, job placement
or resource allocation inefficiencies
 All is managed at the job/aggregation level
Advanced Monitoring and Analysis

More Related Content

What's hot

BKK16-205 RDK-B IoT
BKK16-205 RDK-B IoTBKK16-205 RDK-B IoT
BKK16-205 RDK-B IoTLinaro
 
Introduction to SDN and Network Programmability - BRKRST-1014 | 2017/Las Vegas
Introduction to SDN and Network Programmability - BRKRST-1014 | 2017/Las VegasIntroduction to SDN and Network Programmability - BRKRST-1014 | 2017/Las Vegas
Introduction to SDN and Network Programmability - BRKRST-1014 | 2017/Las VegasBruno Teixeira
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
 
Introduction to OFI
Introduction to OFIIntroduction to OFI
Introduction to OFIseanhefty
 
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and RoadmapInfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmapinside-BigData.com
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingMichelle Holley
 
Chips alliance omni xtend overview
Chips alliance omni xtend overviewChips alliance omni xtend overview
Chips alliance omni xtend overviewRISC-V International
 
Dataplane programming with eBPF: architecture and tools
Dataplane programming with eBPF: architecture and toolsDataplane programming with eBPF: architecture and tools
Dataplane programming with eBPF: architecture and toolsStefano Salsano
 
Cilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDPCilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDPThomas Graf
 
OFI libfabric Tutorial
OFI libfabric TutorialOFI libfabric Tutorial
OFI libfabric Tutorialdgoodell
 
Network Fundamentals: Ch3 - Application Layer Functionality and Protocols
Network Fundamentals: Ch3 - Application Layer Functionality and ProtocolsNetwork Fundamentals: Ch3 - Application Layer Functionality and Protocols
Network Fundamentals: Ch3 - Application Layer Functionality and ProtocolsAbdelkhalik Mosa
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 
Integration and Interoperation of existing Nexus networks into an ACI Archite...
Integration and Interoperation of existing Nexus networks into an ACI Archite...Integration and Interoperation of existing Nexus networks into an ACI Archite...
Integration and Interoperation of existing Nexus networks into an ACI Archite...Cisco Canada
 

What's hot (20)

InfiniBand
InfiniBandInfiniBand
InfiniBand
 
BKK16-205 RDK-B IoT
BKK16-205 RDK-B IoTBKK16-205 RDK-B IoT
BKK16-205 RDK-B IoT
 
Introduction to SDN and Network Programmability - BRKRST-1014 | 2017/Las Vegas
Introduction to SDN and Network Programmability - BRKRST-1014 | 2017/Las VegasIntroduction to SDN and Network Programmability - BRKRST-1014 | 2017/Las Vegas
Introduction to SDN and Network Programmability - BRKRST-1014 | 2017/Las Vegas
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
Ospf.ppt
Ospf.pptOspf.ppt
Ospf.ppt
 
Introduction to OFI
Introduction to OFIIntroduction to OFI
Introduction to OFI
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and RoadmapInfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmap
 
An introduction to MQTT
An introduction to MQTTAn introduction to MQTT
An introduction to MQTT
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
 
Chips alliance omni xtend overview
Chips alliance omni xtend overviewChips alliance omni xtend overview
Chips alliance omni xtend overview
 
Basic Linux Internals
Basic Linux InternalsBasic Linux Internals
Basic Linux Internals
 
Dataplane programming with eBPF: architecture and tools
Dataplane programming with eBPF: architecture and toolsDataplane programming with eBPF: architecture and tools
Dataplane programming with eBPF: architecture and tools
 
Cilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDPCilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDP
 
OFI libfabric Tutorial
OFI libfabric TutorialOFI libfabric Tutorial
OFI libfabric Tutorial
 
PCI express
PCI expressPCI express
PCI express
 
Network Fundamentals: Ch3 - Application Layer Functionality and Protocols
Network Fundamentals: Ch3 - Application Layer Functionality and ProtocolsNetwork Fundamentals: Ch3 - Application Layer Functionality and Protocols
Network Fundamentals: Ch3 - Application Layer Functionality and Protocols
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Integration and Interoperation of existing Nexus networks into an ACI Archite...
Integration and Interoperation of existing Nexus networks into an ACI Archite...Integration and Interoperation of existing Nexus networks into an ACI Archite...
Integration and Interoperation of existing Nexus networks into an ACI Archite...
 
MPLS & BASIC LDP
MPLS & BASIC LDPMPLS & BASIC LDP
MPLS & BASIC LDP
 

Similar to InfiniBand Essentials Every HPC Expert Must Know

How to Implement SDN Technology in ITB
How to Implement SDN Technology in ITBHow to Implement SDN Technology in ITB
How to Implement SDN Technology in ITBSDNRG ITB
 
RINA essentials, PISA Internet Festival 2015
RINA essentials, PISA Internet Festival 2015RINA essentials, PISA Internet Festival 2015
RINA essentials, PISA Internet Festival 2015ICT PRISTINE
 
IBM System Networking Overview - Jul 2013
IBM System Networking Overview - Jul 2013IBM System Networking Overview - Jul 2013
IBM System Networking Overview - Jul 2013Angel Villar Garea
 
Vsat day-2008-idirect
Vsat day-2008-idirectVsat day-2008-idirect
Vsat day-2008-idirectSSPI Brasil
 
CCNA PPT
CCNA PPTCCNA PPT
CCNA PPTAIRTEL
 
Design and Deployment of Enterprise WLANs
Design and Deployment of Enterprise WLANsDesign and Deployment of Enterprise WLANs
Design and Deployment of Enterprise WLANsFab Fusaro
 
2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANLdgoodell
 
Ip interfaces by faststream technologies
Ip interfaces by faststream technologiesIp interfaces by faststream technologies
Ip interfaces by faststream technologiesVishalMalhotra58
 
ch5-Fog Networks and Cloud Computing
ch5-Fog Networks and Cloud Computingch5-Fog Networks and Cloud Computing
ch5-Fog Networks and Cloud Computingssuser06ea42
 
Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed Eth...
Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed Eth...Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed Eth...
Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed Eth...Mason Mei
 
IRATI Experimentation, US-EU FIRE Workshop
IRATI Experimentation, US-EU FIRE WorkshopIRATI Experimentation, US-EU FIRE Workshop
IRATI Experimentation, US-EU FIRE WorkshopEleni Trouva
 
Tech Tutorial by Vikram Dham: Let's build MPLS router using SDN
Tech Tutorial by Vikram Dham: Let's build MPLS router using SDNTech Tutorial by Vikram Dham: Let's build MPLS router using SDN
Tech Tutorial by Vikram Dham: Let's build MPLS router using SDNnvirters
 
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...Ceph Community
 
Introduction to Ethernet para radio enlace
Introduction to Ethernet para radio enlaceIntroduction to Ethernet para radio enlace
Introduction to Ethernet para radio enlacejonatanmedeirosgomes1
 

Similar to InfiniBand Essentials Every HPC Expert Must Know (20)

Seminar on Infiniband
Seminar on Infiniband Seminar on Infiniband
Seminar on Infiniband
 
Infini band presentation_shekhar
Infini band presentation_shekharInfini band presentation_shekhar
Infini band presentation_shekhar
 
To Infiniband and Beyond
To Infiniband and BeyondTo Infiniband and Beyond
To Infiniband and Beyond
 
Mellanox Approach to NFV & SDN
Mellanox Approach to NFV & SDNMellanox Approach to NFV & SDN
Mellanox Approach to NFV & SDN
 
Infini band
Infini bandInfini band
Infini band
 
How to Implement SDN Technology in ITB
How to Implement SDN Technology in ITBHow to Implement SDN Technology in ITB
How to Implement SDN Technology in ITB
 
RINA essentials, PISA Internet Festival 2015
RINA essentials, PISA Internet Festival 2015RINA essentials, PISA Internet Festival 2015
RINA essentials, PISA Internet Festival 2015
 
IBM System Networking Overview - Jul 2013
IBM System Networking Overview - Jul 2013IBM System Networking Overview - Jul 2013
IBM System Networking Overview - Jul 2013
 
Vsat day-2008-idirect
Vsat day-2008-idirectVsat day-2008-idirect
Vsat day-2008-idirect
 
CCNA PPT
CCNA PPTCCNA PPT
CCNA PPT
 
Design and Deployment of Enterprise WLANs
Design and Deployment of Enterprise WLANsDesign and Deployment of Enterprise WLANs
Design and Deployment of Enterprise WLANs
 
2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL
 
Ip interfaces by faststream technologies
Ip interfaces by faststream technologiesIp interfaces by faststream technologies
Ip interfaces by faststream technologies
 
ch5-Fog Networks and Cloud Computing
ch5-Fog Networks and Cloud Computingch5-Fog Networks and Cloud Computing
ch5-Fog Networks and Cloud Computing
 
Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed Eth...
Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed Eth...Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed Eth...
Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed Eth...
 
IRATI Experimentation, US-EU FIRE Workshop
IRATI Experimentation, US-EU FIRE WorkshopIRATI Experimentation, US-EU FIRE Workshop
IRATI Experimentation, US-EU FIRE Workshop
 
Tech Tutorial by Vikram Dham: Let's build MPLS router using SDN
Tech Tutorial by Vikram Dham: Let's build MPLS router using SDNTech Tutorial by Vikram Dham: Let's build MPLS router using SDN
Tech Tutorial by Vikram Dham: Let's build MPLS router using SDN
 
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
 
Network Concepts
Network ConceptsNetwork Concepts
Network Concepts
 
Introduction to Ethernet para radio enlace
Introduction to Ethernet para radio enlaceIntroduction to Ethernet para radio enlace
Introduction to Ethernet para radio enlace
 

More from Mellanox Technologies

InfiniBand Growth Trends - TOP500 (July 2015)
InfiniBand Growth Trends - TOP500 (July 2015)InfiniBand Growth Trends - TOP500 (July 2015)
InfiniBand Growth Trends - TOP500 (July 2015)Mellanox Technologies
 
Ahead of the NFV Curve with Truly Scale-out Network Function Cloudification
Ahead of the NFV Curve with Truly Scale-out Network Function CloudificationAhead of the NFV Curve with Truly Scale-out Network Function Cloudification
Ahead of the NFV Curve with Truly Scale-out Network Function CloudificationMellanox Technologies
 
InfiniBand Strengthens Leadership as the Interconnect Of Choice
InfiniBand Strengthens Leadership as the Interconnect Of ChoiceInfiniBand Strengthens Leadership as the Interconnect Of Choice
InfiniBand Strengthens Leadership as the Interconnect Of ChoiceMellanox Technologies
 
CloudX – Expand Your Cloud into the Future
CloudX – Expand Your Cloud into the FutureCloudX – Expand Your Cloud into the Future
CloudX – Expand Your Cloud into the FutureMellanox Technologies
 
Interop Tokyo 2014 -- Mellanox Demonstrations
Interop Tokyo 2014 -- Mellanox DemonstrationsInterop Tokyo 2014 -- Mellanox Demonstrations
Interop Tokyo 2014 -- Mellanox DemonstrationsMellanox Technologies
 
Mellanox High Performance Networks for Ceph
Mellanox High Performance Networks for CephMellanox High Performance Networks for Ceph
Mellanox High Performance Networks for CephMellanox Technologies
 
Interconnect Your Future With Mellanox
Interconnect Your Future With MellanoxInterconnect Your Future With Mellanox
Interconnect Your Future With MellanoxMellanox Technologies
 
Interconnect Your Future with Connect-IB
Interconnect Your Future with Connect-IBInterconnect Your Future with Connect-IB
Interconnect Your Future with Connect-IBMellanox Technologies
 
MetroX™ – Mellanox Long Haul Solutions
MetroX™ – Mellanox Long Haul SolutionsMetroX™ – Mellanox Long Haul Solutions
MetroX™ – Mellanox Long Haul SolutionsMellanox Technologies
 
Unified Fabric Manager - HP Insight CMU Connector
Unified Fabric Manager - HP Insight CMU ConnectorUnified Fabric Manager - HP Insight CMU Connector
Unified Fabric Manager - HP Insight CMU ConnectorMellanox Technologies
 
Storage, Cloud, Web 2.0, Big Data Driving Growth
Storage, Cloud, Web 2.0, Big Data Driving GrowthStorage, Cloud, Web 2.0, Big Data Driving Growth
Storage, Cloud, Web 2.0, Big Data Driving GrowthMellanox Technologies
 

More from Mellanox Technologies (20)

InfiniBand Growth Trends - TOP500 (July 2015)
InfiniBand Growth Trends - TOP500 (July 2015)InfiniBand Growth Trends - TOP500 (July 2015)
InfiniBand Growth Trends - TOP500 (July 2015)
 
Ahead of the NFV Curve with Truly Scale-out Network Function Cloudification
Ahead of the NFV Curve with Truly Scale-out Network Function CloudificationAhead of the NFV Curve with Truly Scale-out Network Function Cloudification
Ahead of the NFV Curve with Truly Scale-out Network Function Cloudification
 
InfiniBand Strengthens Leadership as the Interconnect Of Choice
InfiniBand Strengthens Leadership as the Interconnect Of ChoiceInfiniBand Strengthens Leadership as the Interconnect Of Choice
InfiniBand Strengthens Leadership as the Interconnect Of Choice
 
CloudX – Expand Your Cloud into the Future
CloudX – Expand Your Cloud into the FutureCloudX – Expand Your Cloud into the Future
CloudX – Expand Your Cloud into the Future
 
Mellanox VXLAN Acceleration
Mellanox VXLAN AccelerationMellanox VXLAN Acceleration
Mellanox VXLAN Acceleration
 
Virtualization Acceleration
Virtualization Acceleration Virtualization Acceleration
Virtualization Acceleration
 
Interop Tokyo 2014 -- Mellanox Demonstrations
Interop Tokyo 2014 -- Mellanox DemonstrationsInterop Tokyo 2014 -- Mellanox Demonstrations
Interop Tokyo 2014 -- Mellanox Demonstrations
 
Mellanox High Performance Networks for Ceph
Mellanox High Performance Networks for CephMellanox High Performance Networks for Ceph
Mellanox High Performance Networks for Ceph
 
Interconnect Your Future With Mellanox
Interconnect Your Future With MellanoxInterconnect Your Future With Mellanox
Interconnect Your Future With Mellanox
 
Become a Supercomputer Hero
Become a Supercomputer HeroBecome a Supercomputer Hero
Become a Supercomputer Hero
 
Interconnect Product Portfolio
Interconnect Product PortfolioInterconnect Product Portfolio
Interconnect Product Portfolio
 
The Generation of Open Ethernet
The Generation of Open Ethernet The Generation of Open Ethernet
The Generation of Open Ethernet
 
Interconnect Your Future with Connect-IB
Interconnect Your Future with Connect-IBInterconnect Your Future with Connect-IB
Interconnect Your Future with Connect-IB
 
MetroX™ – Mellanox Long Haul Solutions
MetroX™ – Mellanox Long Haul SolutionsMetroX™ – Mellanox Long Haul Solutions
MetroX™ – Mellanox Long Haul Solutions
 
Unified Fabric Manager - HP Insight CMU Connector
Unified Fabric Manager - HP Insight CMU ConnectorUnified Fabric Manager - HP Insight CMU Connector
Unified Fabric Manager - HP Insight CMU Connector
 
Print 'N Fly - SC13
Print 'N Fly - SC13Print 'N Fly - SC13
Print 'N Fly - SC13
 
Mellanox 2013 Analyst Day
Mellanox 2013 Analyst DayMellanox 2013 Analyst Day
Mellanox 2013 Analyst Day
 
Interconnect Your Future
Interconnect Your FutureInterconnect Your Future
Interconnect Your Future
 
Mellanox's Technological Advantage
Mellanox's Technological AdvantageMellanox's Technological Advantage
Mellanox's Technological Advantage
 
Storage, Cloud, Web 2.0, Big Data Driving Growth
Storage, Cloud, Web 2.0, Big Data Driving GrowthStorage, Cloud, Web 2.0, Big Data Driving Growth
Storage, Cloud, Web 2.0, Big Data Driving Growth
 

Recently uploaded

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 

Recently uploaded (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

InfiniBand Essentials Every HPC Expert Must Know

  • 1. © 2014 Mellanox Technologies 1- Mellanox Confidential - Oded Paz April 2014 InfiniBand Essentials Every HPC Expert Must Know
  • 2. Mellanox Training Center 2Training Material  IB Principles • Targets • Fabric Components • Fabric architecture  Fabric Discovery Stages • Topology discovery • Information Gathering • Forwarding Tables • Fabric SDN • Fabric Activation  Protocol Layers Principle • Supported Upper Layer protocols • Transport layer • Link Layer • Physical Layer  Mellanox Products • InfiniBand Switches • Channel Adapters • Cabling • Fabric Management HPC
  • 3. Mellanox Training Center 3Training Material Leading Supplier of End-to-End Interconnect Solutions Host/Fabric SoftwareICs Switches/GatewaysAdapter Cards Cables Comprehensive End-to-End InfiniBand and Ethernet Portfolio Virtual Protocol Interconnect Storage Front / Back-End Server / Compute Switch / Gateway 56G IB & FCoIB 56G InfiniBand 10/40/56GbE & FCoE 10/40/56GbE Fibre Channel Virtual Protocol Interconnect
  • 4. Mellanox Training Center 4Training Material Mellanox Common Target Implementations Web 2.0 DB/EnterpriseHPC Up to 10X Performance and Simulation Runtime 33% Higher GPU Performance Unlimited Scalability Lowest Latency 62% Better Execution Time 42% Faster Messages Per Second Cloud 12X More Throughput Support More Users at Higher Bandwidth Improve and Guarantee SLAs 10X Database Query Performance 4X Faster VM Migration More VMs per Server and More Bandwidth per VM Storage 2X Hadoop Performance 13X Memcached Performance 4X Price/ Performance Mellanox storage acceleration software provides >80% more IOPS (I/O operations per second) Financial Services
  • 5. Mellanox Training Center 5Training Material Mellanox VPI Interconnect Solutions Mezzanine Card VPI Adapter Ethernet: 10/40 Gb/s InfiniBand:10/20/40/ 56 Gb/s Networking Storage Clustering Management Applications Acceleration Engines LOM Adapter Card 3.0 64 ports 10GbE 36 ports 40GbE 48 10GbE + 12 40GbE 36 ports IB up to 56Gb/s 8 VPI subnets Switch OS Layer Unified Fabric Manager VPI Switch
  • 6. Mellanox Training Center 6Training Material Leading Interconnect, Leading Performance Latency 5usec 2.5usec 1.3usec 0.7usec <0.5usec 160/200Gb/s 100Gb/s 56Gb/s 40Gb/s 20Gb/s 10Gb/s 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Bandwidth Same Software Interface 0.5usec SDR DDR QDR FDR &FDR 10 EDR NDR
  • 7. Mellanox Training Center 7Training Material  Founded in 1999  Actively markets and promotes InfiniBand from an industry perspective through public relations engagements, developer conferences and workshops  InfiniBand software is developed under OpenFabrics Open Source Alliance http://www.openfabrics.org/index.html  InfiniBand standard is developed by the InfiniBand Trade Association (IBTA) http://www.infinibandta.org/home Steering Committee Members: InfiniBand Trade Association (IBTA)
  • 8. Mellanox Training Center 8Training Material InfiniBand is a Switch Fabric Architecture  Interconnect technology connecting CPUs and I/O  Super high performance • High bandwidth (starting at 10Gb/s and up to 100Gb/s) • Low latency– fast application response across the cluster < 1µs end to end ( Mellanox switches 170 nanosec per HOP ) • Low CPU utilization with RDMA (Remote Direct Memory Access) – Unlike Ethernet, TRAFFIC communication bypasses the OS and the CPU’s. First industry standard high speed interconnect!
  • 9. Mellanox Training Center 9Training Material  InfiniBand was originally designed for large-scale grids and clusters  Increased application performance  Single port solution for all LAN, SAN, and application communication  High reliability CLUSTER management (Redundant Subnet Manager)  Automatic Cluster switches and ports configuration performed by the Subnet Manager SW InfiniBand is a Switch Fabric Architecture First industry-standard high speed interconnect!
  • 10. Mellanox Training Center 10Training Material RDMA – How Does it Work RDMA over InfiniBand KERNELHARDWAREUSER RACK 1 OS NIC Buffer 1 Application 1 Application 2 OS Buffer 1 NICBuffer 1 TCP/IP RACK 2 HCA HCA Buffer 1Buffer 1 Buffer 1 Buffer 1 Buffer 1
  • 11. Mellanox Training Center 11Training Material  Industry-standard defined by the InfiniBand Trade Association  Defines System Area Network architecture • Comprehensive specification: from physical to applications  Architecture supports • Host Channel Adapters (HCA) • Switches • Routers The InfiniBand Architecture Processor Node InfiniBand Subnet Gateway HCA Processor Node Processor Node HCA HCA Storage Subsystem Consoles RAID Ethernet Gateway Fibre Channel HCA Subnet Manager •Switch Switch Switch Switch
  • 12. Mellanox Training Center 12Training Material  Host Channel Adapter (HCA) • Device that terminates an IB link and executes transport-level functions and support the verbs interface  Switch • A device that moves packets from one link to another of the same IB Subnet  Router • A device that transports packets between different IBA subnets  Bridge/Gateway • InfiniBand to Ethernet InfiniBand Components Overview
  • 13. Mellanox Training Center 13Training Material  Equivalent to a NIC (Ethernet) - GUID Global Unique ID  Converts PCI to InfiniBand  CPU offload of transport operations  End-to-end QoS and congestion control  HCA bandwidth options: • Single Data Rate 2.5GB/S * 4 = 10 • Double Data Rate 5 GB/S * 4 = 20 • Quadruple Data Rate 10GB/S * 4 = 40 • Fourteen Data Rate 14 Gb/s * 4 = 56 • Enhanced Data Rate 25 Gb/s * 4 = 100 Host Channel Adapters (HCA)
  • 14. Mellanox Training Center 14Training Material  Any InfiniBand node requires GUID&LID addresses  GUID (Global Unique Identifier)- 64 bits address, “Like a Ethernet MAC address” • Assigned by IB vendor • Persistent through reboots  IB Switch “Multiple” Address GUIDS • Node = Is meant to identify the HCA as a entity • Port = Identifies the port as a port • System = Allows to combine multiple GUIDS creating one entity Global Unique Identifier (GUID) – Physical Address
  • 15. Mellanox Training Center 15Training Material  A single 36 ports IB switch chip, is the Basic Block for every IB switch module  We create a multiple ports switching module using multiple chips  In this example we create 72 ports switch, using 6 identical chips: • 4 chips will function as lines • 2 chips will function as core The IB Fabric Basic Building Block
  • 16. Mellanox Training Center 16Training Material IB Fabric L2 Switching Addressing Local Identifier (LID)  Local Identifier- 16 bit L2 Address • Assigned by the Subnet Manager when port becomes active • Not persistent through reboots  LID Address Ranges • 0x 0000 = Reserved • 0x0001 = 0xBFFF = Unicast • 0xc001 = 0xFFFE = Multicast • 0xFFFF = Reserved for special use
  • 17. Mellanox Training Center 17Training Material  Define different partitions for different customers  Define different partitions for different applications  Allows fabric partitioning for security purposes  Allows fabric partitioning for Quality of Service (QoS)  Each partition has an Identifier named PKEY InfiniBand Network Segmentation – Partitions PKEY ID 2 Sevice Level 3 PKEY ID 3 Sevice Level 3 PKEY ID 3 Sevice Level 1
  • 18. Mellanox Training Center 18Training Material  Usage • A 128 bit field in the Global Routing Header (GRH) used to route packets between different IB subnets • Multicast groups port identifier IB & IPOIB  Structure • GUID- 64 bit identifier provided by the manufacturer • IPv6 type header • Subnet Prefix: A 0 to 64-bit: - Identifier used to uniquely identify a set of end-ports which are managed by a common Subnet Manager GID - Global Identifier port GUID: 0x0002c90300455fd1 fe80:0000:0000:0x0002c90300455fd1default gid:
  • 19. Mellanox Training Center 19Training Material  Node: any managed entity– End Node, Switch, Router  Manager: active entity; sources commands and queries • The subnet manager (SM)  Agent: passive (mostly) entity that will reside on every node, responds to Subnet Managers queries  Management Datagram (MAD): • Standard message format for manager–agent communication • Carried in an unreliable datagram (UD) IB Basic Management Concepts Agent Agent Agent Agent Agent Agent Agent Agent
  • 20. Mellanox Training Center 20Training Material Objectives of Subnet Management  Initialization and configuration of the subnet elements  Establishing best traffic paths between source to destination through the subnet  Fault isolation  Continue these activities during topology changes  Prevent unauthorized Subnet Managers
  • 21. Mellanox Training Center 21Training Material Node & Switch Main identifiers IB Port Basic Identifiers • Host Channel Adapter– HCA (IB “NIC” ) • Port number • Global Universal ID– GUID 64 bit (like mac) ex. 00:02:C9:02:00:41:38:30 - Each 36 ports “basic “ switch has its own switch & system GUID - All ports belong to the same “basic “ switch will share the switch GUID • Local Identifier - LID  LID • Local Identifier that is assigned to any IB device by the SM and used for packets switching within an IB fabric . • All ports of the same ASIC unit are using the same LID
  • 22. Mellanox Training Center 22Training Material Node & Switch Main identifiers  Virtual Lane • Each Virtual Lane uses different buffers to send its packet towards the other side • VL 15 is used for management only SM traffic • VL 0-7 are used for traffic • Used to separate different bandwidth & QoS using same physical port
  • 23. Mellanox Training Center 23Training Material De-Mux Mux Traffic Packets VL 0-7 Packets Transmitted Link Control VL-15 Traffic Packets VL 0-7 Node & Switch Main identifiers
  • 24. Mellanox Training Center 24Training Material Subnet Manager Cluster Discovery
  • 25. Mellanox Training Center 25Training Material 1. Physical Subnet Establishment 2. Subnet Discovery 3. Information Gathering 4. LID Assignment 5. Path Establishment 6. Port Configuration 7. Switch Configuration 8. Subnet Activation Subnet Manager & Fabric configuration Process
  • 26. Mellanox Training Center 26Training Material Subnet Manager (SM) Rules & Roles  Every subnet must have at least one - Manages all elements in the IB fabric - Discover subnet topology - Assign LIDs to devices - Calculate and program switch chip forwarding tables (LFT pathing) - Monitor changes in subnet  Implemented anywhere in the fabric - Node, Switch, Specialized device  No more than one active SM allowed - 1 Active (Master) and remaining are Standby (HA)
  • 27. Mellanox Training Center 27Training Material 1. The SM wakes up and starts the Fabric Discovery process 2. The SM starts “ conversation “ with every node , over the InfiniBand link it is connected to . in this stage the discovery stage, the SM collects : • Switch Information followed by port information • Host information 3. Any switch which is already discovered , will be used as a gate for the SM , for further discovery of all this switch links and the switches it is connected to known also as its neighbors. Fabric Discovery (A)
  • 28. Mellanox Training Center 28Training Material 4. The SM gathers information by sending and receiving SMPs (Subnet Management Packets) a. These special management packets are sent on Virtual Lane 15 (VL15) • VL15 is a special NON flow controlled VL b. Two primary “types” of SMPs creating Cluster routing table: • Directed routing (DR) table based on Nodes GUIDS & port number • This is the type primarily used by OpenSM c. LID routing (LR) • Topology and than packets routing table , Based on the LIDS which have been assigned to each node by the SM Fabric Discovery (B)
  • 29. Mellanox Training Center 29Training Material  Node Info Gathered • Node type • Num of ports • GUID • Partition table size  Port Info Gathered • Forwarding Database size • MTU • Width • VLs Fabric Information Gathering During Discovery
  • 30. Mellanox Training Center 30Training Material Fabric Direct Route Information Gathering  Building the direct routing table from & to each one of the fabric elements  Each node in a path is identified by its port number & GUID  The table content is saved in the SM LMX table Switc h-1 Switch-2 Switch-5 Switch-3 Switch-6 Switch-4 H-11 H-16 Switc h-1 Port2 Port2 Switch2 Port2 Port 3 Port 3 Switch 3 Port 5 Port 8 Port 3 Switch 3_Port5 Switch 7_Port29 Port 8 Switch 4_Port 30 Switch 3_Port 5 Switch 6_Port 29 Switc h-1 Port 8 Switch 5_Port9 Switch 3_Port5 H11 Port 1 Switch 6_Port2 Switch 3_Port4 Switch 4_Port30 Port 1 Switch 6_Port2 Switch 3_Port1 Switch1_P ort2 Switch2_P ort2 Port 1 Switch6_P ort2 Port 1 Port 1 Switch 6_Port2 Switch 3_Port4 Port 1 Switch 6_Port 2 Switch 3_Port 4 Switch 4_Port 30
  • 31. Mellanox Training Center 31Training Material LID Assignment  After the SM finished gathering any needed subnet information, it assigns a base LID and LMC to each one of the attached end ports • The LID is assigned to at the port rather than device level • Switch external ports do not get/need LIDs  The DLID is used as the main address for InfiniBand packet switching  Each Switch port can be identified by the combination of LID & port number LID 3 LID 21 LID 22 LID 23 LID 72 LID 75 LID 82 LID 81
  • 32. Mellanox Training Center 32Training Material Linear Forwarding Table Establishment (Path Establishment)  After the SM finished gathering all Fabric information , including direct route tables , it assigns a LID to each one of the NODES  At this stage the LMX table will be populated with the relevant route options to each one of the nodes  The output of the LMX will provide the Best Route to Reach a DLID as well as the other Routes .  The Best Path Result Will be based on Shortest Path First (SPF) algorithm 21 1 2 3 1 22 2 1 2 1 23 3 2 1 1 75 3 2 3 2 81 4 3 4 3 82 4 3 2 2 The Dest. LID Best Route/ exit port 21 2 22 3 23 8 75 3 81 3 82 8 D-LID PORT LID 3 LID 21 LID 22 LID 23 LID 72 LID 75 LID 82 LID 81
  • 33. Mellanox Training Center 33Training Material LID Routed (LR) Forwarding  Uses the LFT tables  Based on the data gathered on the LMX – Direct Routing  It is the standard routing of packets used by switches  Uses regular link-level headers to define destination and other information, such as: • DLID = LID of the final destination • SL = Service Level of the path • Each switch uses the forwarding table and SL to VL table to decide on the packet’s output port/VL LFT Switch_1 The Destination LID Best Route/ exit port 21 2 22 3 23 8 75 3 81 3 82 8
  • 34. Mellanox Training Center 34Training Material  LRH: Local Routing Header : • Source & Destination LID • Service Level-SL • Virtual Lane-VL • Packet Length LID Routed (LR) Forwarding LFT Switch_1 The Dest. LID Best Route/ exit port 21 2 22 3 23 8 75 3 81 3 82 8 LRH GRH BTH Ext HDRs Playload ICRC VCRC InfiniBand Data Packet 8B 40B 12B Var 0…4096B 4B 2B
  • 35. Mellanox Training Center 35Training Material  Light sweep : • Routine sweep of the Subnet Manager • By default runs every 30 second • Requires all switches to switch and port info  Light Sweep traces : • Ports status change • New SM speaks on the subnet • Subnet Manager changes priority Tracking FABRIC STATUS – SM Sweeps LID 3 LID 21 LID 22 LID 23 LID 72 LID 75 LID 82 LID 81
  • 36. Mellanox Training Center 36Training Material  Any change traced by the light sweep will cause Heavy Sweep  IB TRAP • Changes of status of a switch will cause an on line IB TRAP that will be sent to the Subnet Manager and cause Heavy Sweep  Heavy Sweep • Will cause all SM fabric discovery to be performed from scratch Tracking FABRIC STATUS – SM Sweeps
  • 37. Mellanox Training Center 37Training Material InfiniBand Fabric Topologies
  • 38. Mellanox Training Center 38Training Material InfiniBand Fabric Commonly Used Topologies Back to Back 2 Tier Fat Tree Modular switches are based on Fat Tree architecture: 3D Torus Dual Star Hybrid
  • 39. Mellanox Training Center 39Training Material The IB Fabric Basic Building Block  A single 36 ports IB switch chip, is the Basic block for every IB switch module  We create a multiple ports switching Module using multiple chips  In this example we create 72 ports switch, using 6 identical chips • 4 chips will function as lines • 2 chips will function as core
  • 40. Mellanox Training Center 40Training Material CLOS Topology  Pyramid Shape Topology  The switches at the top of the pyramid are called Spines/Core • The Core/Spine switches are interconnected to the other switch environments  The switches at the bottom of the Pyramid are called Leafs/Lines/Edges • The Leaf/Lines/Edge are connected to the fabric nodes/hosts  In a non blocking CLOS fabric there are equal number of external and internal connections Internal Connections Spines/Core External Connections Leaf/Line/Edge
  • 41. Mellanox Training Center 41Training Material  External connections : • The connections between the hosts and the Line switches  Internal Connections • The connections between the core and the Line switches  In a non blocking fabric there is always a balanced cross bisectional bandwidth  In case the number of external connections is higher than internal connections, we have a blocking configuration CLOS Topology Internal Connections Spines/Core External Connections Leaf/Line/Edge
  • 42. Mellanox Training Center 42Training Material CLOS - 3  The topology detailed here is called CLOS 3  The maximum traffic path between source to destination includes 3 HOPS (3 switches)  Example a session between A to B • One Hop from A to switch L1-1 • Next Hop from switch L1-1 to switch L2-1 • Last Hop from L2-1 to L1-4 Spines/Core Leaf/Line/Edge
  • 43. Mellanox Training Center 43Training Material  In this example we can see 108 non blocked fabric • 108 hosts are connected to the line switches • 108 links connect between the line switches to the core switches to enable non blocking interconnection of the line switches CLOS - 3 18*6=108
  • 44. Mellanox Training Center 44Training Material IB Fabric Protocol Layers
  • 45. Mellanox Training Center 45Training Material IB Switch - L2 Upper Level Protocols Transport Layer Network Layer Link Layer Physical Layer Client Transactions Messages Que Pairs Inter Subnet Routing End Node Switch L2 End Node L2 Switching LID Based Client IBA Operations SAR IBA Operations SAR Network Network Link Encoding Media Access Control Link Encoding Media Access Control Packet Relay MAC MAC
  • 46. Mellanox Training Center 46Training Material IB Architecture Layers  Software Transport Verbs and Upper Layer Protocols: - Interface between application programs and hardware. - Allows support of legacy protocols such as TCP/IP - Defines methodology for management functions  Physical: - Signal levels and frequency, media, connectors  Transport: - Delivers packets to the appropriate Queue Pair; Message Assembly/De-assembly, access rights, etc.  Data Link (symbols and framing): - From source to destination on the same partition subnet Flow control (credit-based); How packets are routed  Network: - How packets are routed between different partitions/subnets Client Transactions Messages Que Pairs Inter Subnet Routing End Node Switch L2 End Node L2 Switching LID Based Client IBA Operations SAR IBA Operations SAR Network Network Link EncodingMedia Access Control Link EncodingMedia Access Control Packet Relay MAC MAC
  • 47. Mellanox Training Center 47Training Material InfiniBand Header Structure Upper Layer Protocol Transactions Messages Que Pairs Inter Subnet Routing End Node Switch L2 End Node L2 Switching LID Based Network Link Layer Packet Relay MAC MAC Transport Original Message SAR SAR SAR SAR Packets Subnet Prefix +GUID Network Link Layer Transport SAR SAR SAR SAR Physical Layer Physical Layer Upper Layer Protocol Original Message
  • 48. Mellanox Training Center 48Training Material  The Network and Link protocols deliver a packet to the desired destination.  The Transport Layer • Segmenting Assembly & Reassembly of : -Messages data payload coming from the Upper Layer, into multiple packets that will suit valid MTU size • Delivers the packet to the proper Queue Pair (assigned to a specific session ) • Instructs the QP how to process the packet’s data (Work Request Eelement ) • Reassembles the packets arriving from the other side into messages Transport Layer – Responsibilities Transmit Receive WQE Local QP
  • 49. Mellanox Training Center 49Training Material Upper Layer Protocol Transactions Messages Que Pairs Inter Subnet Routing End Node Switch L2 End Node L2 Switching LID Based Network Link Layer Packet Relay MAC MAC Transport Original Message SAR SAR SAR SAR Packets Subnet Prefix +GUID Network Link Layer Transport SAR SAR SAR SAR Physical Layer Physical Layer Upper Layer Protocol Original Message Transport Layer – Responsibilities
  • 50. Mellanox Training Center 50Training Material  Switches use FDB (Forwarding Database) • Based on DLID and SL, a packet is sent to the correct output port +specific VL Layer 2 Forwarding RX Inbound Packet Outbound Packet link RX TX RX TX SL DLID Payload FDB (DLID to Port) SL to VL Table
  • 51. Mellanox Training Center 51Training Material Arbitration De- mux Mux Link Control Packets Credits Returned Link Control Receive BuffersPackets Transmitted  Credit-based link-level flow control • Link Flow Control assures NO packet loss within fabric even in the presence of congestion • Link Receivers grant packet receive buffer space credits per Virtual Lane • Flow Control credits are issued in 64 byte units  Separate flow control per Virtual Lanes provides: • Alleviation of head-of-line blocking  Virtual Fabrics Congestion and latency on one VL, does not impact traffic with guaranteed QoS on another VL, even though they share the same Physical link Link Layer – Flow Control
  • 52. Mellanox Training Center 52Training Material  InfiniBand is a Lossless fabric.  Maximum Bit Error Rate (BER) allowed by the IB spec is 10e-12. Statistically Mellanox fabrics provides around 10e-15  The Physical layer should guaranty affective signaling to meet this BER requirement Physical Layer- Responsibilities
  • 53. Mellanox Training Center 53Training Material  Industry standard Media types • Copper: 7 Meter QDR , 3 METER FDR • Fiber: 100/300m QDR & FDR  64/66 encoding on FDR links • Encoding makes it possible to send digital high speed signals to a longer distance enhances performance & bandwidth effectiveness • X actual data bits are sent on the line by Y signal bits • 64/66 * 56 = 54.6Gbps  8/10 bit encoding (DDR and QDR) • X/Y line efficiency (example 80% * 40 = 32Gbps) Physical Layer Cont 4X QSFP Fiber 4X QSFP Copper
  • 54. Mellanox Training Center 54Training Material  Mellanox cables are rebranded from a cable vendor • Mellanox cables are manufactured by Mellanox  Our vendor can sell the same cables • No other vendor is allowed to sell Mellanox cables  Mellanox cables use a different assembly procedure  Mellanox cables are tested with unique test suite  Vendors’ “Finished Goods” fail Mellanox dedicated testing  Mellanox allows the customers to use any IBTA IB approved cables Mellanox Cables – Perceptions Vs. Facts Passive Copper Cables SFP+ Active Optical Cables Active Copper Cables
  • 55. Mellanox Training Center 55Training Material  Superior design and qualification process  Committed to Bit Error Rate (BER), better than 10-15  Longest reach with Mellanox end-to-end solution Mellanox Passive Copper Cables Data Rate PCC Max Reach FDR 3 meter FDR10 5 meter QDR 7 meter 40GbE 7 meter 10GbE 7 meter
  • 56. Mellanox Training Center 56Training Material  Superior design and qualification process  Committed to Bit Error Rate (BER), better than 10-15  Longest reach with Mellanox end-to-end solution  Optical Performance Optimization (patent pending) Mellanox Active Fiber Cables Data Rate Max Reach FDR 300 meter FDR10 100 meter QDR 300 meter 40GbE 100 meter
  • 57. Mellanox Training Center 57Training Material Mellanox Family Products
  • 58. Mellanox Training Center 58Training Material Leading Supplier of End-to-End Interconnect Solutions Host/Fabric SoftwareICs Switches/GatewaysAdapter Cards Cables Comprehensive End-to-End InfiniBand and Ethernet Portfolio Virtual Protocol Interconnect Storage Front / Back-End Server / Compute Switch / Gateway 56G IB & FCoIB 56G InfiniBand 10/40/56GbE & FCoE 10/40/56GbE Fibre Channel Virtual Protocol Interconnect
  • 59. Mellanox Training Center 59Training Material InfiniBand Switches & Gateways
  • 60. Mellanox Training Center 60Training Material FDR InfiniBand Switch Portfolio 648 port 324 port 216 port 108 port Modular Switches Edge Switches SX6025 – 36 ports externally managed SX6036 – 36 ports managed SX6018 – 18 ports managed Management SX6015 – 18 ports externally managed SX6012 – 12 ports managedSX6005 – 12 ports externally managed Long Distance Bridge − VPI Bridging Routing NEW NEW NEW NEW NEW NEW
  • 61. Mellanox Training Center 61Training Material SwitchX® VPI Technology Highlights VPI per Port Same box runs InfiniBand AND Ethernet VPI on Box Same box runs InfiniBand OR Ethernet VPI Bridging Same box bridges InfiniBand AND Ethernet Bridging Routing 3 InfiniBand 2 Virtual Protocol Interconnect ® (VPI) One Switch – Multiple Technologies 1 Ethernet InfiniBand Ethernet
  • 62. Mellanox Training Center 62Training Material  Provides InfiniBand and Ethernet Long-Haul Solutions of up to 80km for campus and metro applications.  Connecting between data centers deployed across multiple geographically distributed sites  Extending InfiniBand RDMA and Ethernet RoCE beyond local data centers and storage clusters.  Perfect cost-effective, low power, easily managed and scalable solution  Managed as a single unified network infrastructure. MetroX™ - Mellanox Long-Haul Solutions
  • 63. Mellanox Training Center 63Training Material MetroDX and MetroX Features TX6000 TX6100 TX6240 TX6280 Distance 1KM 10KM 40KM 80KM Throughput 640Gb/s 240Gb/s 80Gb/s 40Gb/s Port Density 16p X FDR10 long haul 16p X FDR downlink 6p X 40Gb/s long haul 6p X 56Gb/s downlink 2p X 10/40Gb/s long haul 2p X 56Gb/s downlink 1p X 10/40Gb/s long haul 1p X 56Gb/s downlink Latency 200ns + 5us/km over fiber 200ns + 5us/km over fiber 700ns + 5us/km over fiber 700ns + 5us/km over fiber Power ~200W ~200W ~280W ~280W QoS One data VL + VL15 One data VL + VL15 One data VL + VL15 One data VL + VL15 Space 1RU 1RU 2RU 2RU
  • 64. Mellanox Training Center 64Training Material Mellanox Host Channel Adapters (HCA) Reference to the following Document : ConnectX®-3 VPI Single and Dual QSFP Port Adapter Card User Manual http://www.mellanox.com/page/products_dyn?product_family=119&mtag=connectx_3_vpi
  • 65. Mellanox Training Center 65Training Material  Up to 56Gb/s InfiniBand or 40 Gigabit Ethernet per port  PCI Express 3.0 (up to 8GT/s)  CPU offload of transport operations  Application offload  GPU communication acceleration  End-to-end QoS and congestion control  Hardware-based I/O virtualization  Dynamic power management  Fiber Channel encapsulation (FCoIB or FCoE)  Ethernet encapsulation (EoIB) HCA ConnectX-3 InfiniBand Main Features VirtualizationDatabase Cloud ComputingHPC
  • 66. Mellanox Training Center 66Training Material Adapters offering ConnectX-3 Pro NVGRE and VxLAN HW off-load RoCE V2(UDP) ECNQCN ConnectX-3 VPI Up to 56Gb IB Up to 56 GbE RDMA CPU off-load SR-IOV Connect-IB Up to 56Gb IB Greater than 100Gb bi-directional DC T10DIF PCIE x16 More than 130M message/sec
  • 67. Mellanox Training Center 67Training Material Fabric Management
  • 68. Mellanox Training Center 68Training Material  Enable fast cluster bring-up • Point out issues with devices, systems, cables • Provide inventory including cables, devices, FW, SW • Perform device specific (proprietary) checks • Eye-Opening and BER checks • Catch cabling mistakes  Validate Subnet Manager work • Verify connectivity at the lowest level possible • Report subnet configuration • SM agnostic Goal of Fabric Utilities in HPC Context
  • 69. Mellanox Training Center 69Training Material  Diagnose L2 communication failures • At the entire subnet level • On a point to point path  Monitor the Network Health • Continuous and with low overhead Goal of Fabric Utilities in HPC Context
  • 70. Mellanox Training Center 70Training Material  ibutils: ibdiagnet/ibdiagpath • An automated L2 health analysis procedure • Text interface • No dedicated “monitoring” mode • Significant development past year on features and runtime performance at scale  UFM • Highend monitoring and Provisioning capabilities • GUI based with CLI options • Includes ibutils capabilities with additional features • Central device management - Fabric dashboard - Congestion analysis • System Integration Capabilities - SMP Traps and Alarms Fabric Management Solution Overview
  • 71. Mellanox Training Center 71Training Material UFM in the Fabric  Software or Appliance form factor  2 or more High Availability  Switch and HCA management  Full Mgmt or Monitoring Only modes Synchronization, Heartbeat
  • 72. Mellanox Training Center 72Training Material Multi-Site Management Site1 Site2 Site n. . .
  • 73. Mellanox Training Center 73Training Material  Extensible architecture • Based on Web-services  Open API for users or 3rd-party extensions • Allows simple reporting, provisioning, and monitoring • Task automation • Software Development Kit  Extensible object model • User-defined fields • User-defined menus Integration with 3rd Party Systems Web Based API Read Write Manage Monitoring System Configuration Mgmt Orchestrator Job Scheduler… Alerts via SNMP Traps Web Based API
  • 74. Mellanox Training Center 74Training Material UFM – Comprehensive Robust Management Automatic Discovery Central Device Management Fabric Dashboard Congestion Analysis Health & Performance Monitoring Service Oriented Provisioning Fabric Health Reports
  • 75. Mellanox Training Center 75Training Material UFM Main Features Automatic Discovery Central Device Mgmt Fabric Dashboard Congestion Analysis Health & Perf Monitoring Advanced Alerting Fabric Health Reports Service Oriented Provisioning
  • 76. Mellanox Training Center 76Training Material  Monitor & analyze fabric performance • Bandwidth utilization • Unique congestion monitoring • Dashboard for aggregated fabric view  Real-time fabric-wide health monitoring • Monitor events and errors through-out the fabric • Threshold based alarms • Granular monitoring of host and switch parameters  Innovative congestion mapping • One view for fabric-wide congestion and traffic patterns • Enables root cause analysis for routing, job placement or resource allocation inefficiencies  All is managed at the job/aggregation level Advanced Monitoring and Analysis