7. Host software stack
• Leverages OpenFabrics Alliance
(OFA)
• Open source elements
• Host software stack via OFA
• Intel Omni-Path FastFabric Tools,
Fabric Manager, and GUI
• OPA support is included in
standard linux distros
• Starting with RHEL 7.3 and SLES
12sp2
9. Performance Scaled Messaging 2(PSM 2)
• User level library which provides API for Intel Omni-Path HFI
• PSM is specifically designed for MPI to provide high MPI rates
• Provides matched queue (MQ), building block for tag matching send and
receive calls
• Implementation to scale upto millions of MPI ranks (WHAAAT?)
• Provides active message (AM) API to implement PGAS programming
model
• OpenSHMEM, GASNet etc.
• Connectionless with minimal on-apdapter state
• Back compatible with PSM (Intel TrueScale)
11. PSM (Programmable I/O)
Host Driven Send
• Optimize latency and message rate for high
priority message
• PIO is done when size is less than or equal to 8kb
Eager Receive
• Data is stored in receive buffers
• Receive buffer copies buffer to application
buffer
12. PSM (SDMA)
Send DMA
• Optimizes bandwidth for Large messages
• 16 SDMA engines for CPU offload
• Generally done when size is greater than
16 kb
Direct data placement
• Data directly placed into application
buffer
13. Network Stack:
In respect to OSI stack,
• Layer 1.5: Link Transfer Protocol
• Responsible for reliable delivery of Layer 2 packets
• Flow control and link control
• Layer 2: Data Link Layer
• Fabric addressing, switching, QoS
• Partitioning support
• Layer 4-7: Application Layer
• Interface between software libraries and Intel OPA HFI
19. Data Link Layer
• Quality of Service support
• Bandwidth allocation and
traffic separation
• Protocol deadlock avoidance
(eg. request/response)
• Congestion management
• Adaptive routing
• Dispersive routing
• Partitions
• Isolation mechanism, where every packet being associated with a single
partition
20. Transport layer and key software
• On packet loss or data, retransmission of packets
• Software layers are provided to provide the network API to user
• PSM
• OFED Verbs
• OFI
Mellanox and Qlogic were in infiniband
Qlogic acquisition
Cray interconnect (Intellectual property)
ASIC (HFI or Switch)
Edge switches have less ports and protocol supported
Director switches have more ports and protocol
Today: card modules
In future: MCP packages
Therefore we will not consume PCIe slot used for OPA
Heart of compute software
Boot over fabric, API for management of storage, drivers, middleware
Element mgmt. stack: on switches (GUI and CLI)
FF Tools: Fabric bring up and debug
FM: Same as Subnet Manager in Infiniband
GUI: Fabric monitoring
Exporting fabric communication services to applications
We had libibverbs (for RDMA)
Applications and providers
Providers: sockets are used for fallback
RDMA basics (Queue pairs, work queue elements etc.)
Draw a diagram for MQ
User lib for Omnipath HFI
Specifically designed for MPI
Rank is a logical way of numbering processes.
Connection state is stored in DRAM
IB:
Connection state
Buffering and state machines
L4 is offloaded
Small portion is placed in cache
Small messages let CPU do the work
Explaining is HFI will be put down throughput
Does not make sense to offload
Same with receive, yes they do copy but again no
Headers are not big in comparison with WQE
Direct message placement
16 SDMA engines to offload
Header generator: Give template and tell which fields to change
RDMA basics (Queue pairs, work queue elements etc.)
1 type bit: Body FLITZ (1), HEAD (10), Tail (100), Others (000)
control flitz (retransmission) and communication flitz
2 Types of LTP
NULL ltp (not stored in reply buffer) and LTP with FLITZ
Infiniband uses FEC which is checked on endpoint
PIP has that on hop basis (can not be disabled)
In infiniband ,whole lane is taken down
OPA, does not take link down, uses PIP
Adaptive: identifies congestion and adjusts itself
Dispersive: Disperses the packets over multiple lanes
Uses multipath between endpoints
Partitions:
Communication is allowed within the partition
Either full or limited member
Managed by fabric manager
Adaptive: identifies congestion and adjusts itself
Dispersive: Disperses the packets over multiple lanes
Uses multipath between endpoints
Partitions:
Communication is allowed within the partition
Either full or limited member
Managed by fabric manager