Optimized SW path to HW
•Minimize cache and memory footprint
•Reduce instruction count
•Minimize memory accesses
Scalable
Implementation
Agnostic
Software interfaces aligned with
user requirements
•Careful requirement analysis
Inclusive development effort
•App and HW developers
Good impedance match with
multiple fabric hardware
•InfiniBand, iWarp, RoCE, raw Ethernet,
UDP offload, Omni-Path, GNI, BGQ, …
Open Source User-Centric
libfabric
User-centric interfaces will help foster fabric
innovation and accelerate their adoption
4
MPI CASE STUDY
MPICH MPI implementation
CH3 – internal portable network
API
• Lack of clear network abstractions led to poor
performance related choices
CH4 – revised internal layering
• Driven based on work done for OFI
• 6x+ reduction in software overhead!
5
Moral: in the absence of a good abstraction, each application
will invent their own, and may result in lower performance
Image source: K. Raffenetti et al. Why is MPI So Slow? Analyzing the
Fundamental Limits in Implementing MPI-3.1. SC ’17
MPI_Put: 1342->44
MPI_Isend: 253->59
OFI USER REQUIREMENTS
6
Give us a high-
level interface!
Give us a low-
level interface!
MPI developers
OFI strives to meet
both requirements
Middleware is primary user
Looking to expand
beyond HPC
GURUEASY
OFI SOFTWARE DEVELOPMENT STRATEGIES
One Size Does Not Fit All
7
Fabric Services
User
OFI
Provider
User
OFI
Provider
Provider optimizes for
OFI features
Common optimization
for all apps/providers
Client uses OFI features
User
OFI
Provider
Client optimizes based
on supported features
Provider supports low-level features only
Linux, FreeBSD,
OS X, Windows
TCP and UDP
development support
GURUEASY
libfabric
OFI LIBFABRIC COMMUNITY
8
Because of an OFI-provider gap, not
all apps work with all providers
Intel® MPI
Library
MPICH
Open MPI
SHMEM
Sandia
SHMEM
GASNet
Clang
UPC
libfabric Enabled Middleware
Sockets
TCP, UDP
Verbs
Cisco
usNIC
Intel
OPA, PSM
Cray
GNI
AWS EFA
IBM Blue
Gene
Shared
Memory
* Other names and brands may be claimed as the property of others
RxM, RxD, Multi-
Rail, Hooks, …
PMDK, Spark, ZeroMQ, TensorFlow,
MxNET, NetIO, Intel MLSL, rsockets …
Charm++ Chapel
HPE
Gen-Z
Advanced application oriented semantics
Tag
Matching
Scalable memory
registration
Triggered
Operations
Multi-Receive
buffers
Reliable Datagram
Endpoints
Remote Completion
Semantics
Streaming
Endpoints
Shared
Addressing
Unexpected
Message Buffering
Network
Direct
OFI insulates applications
from fabric diversity
Providers
Persistent
Memory
SmartNICs and
FPGAs Acceleration
ADDRESS VECTORS
14
Address Vector (Table)
fi_addr Fabric Address
0 100:3:50
1 100:3:51
2 101:3:83
3 102:3:64
… …
Address Vector (Map)
fi_addr
100003050
100003051
101003083
102003064
…
Addresses are referenced by an index
- No application storage required
- O(n) memory in provider
- Lookup required on transfers
OFI returns 64-bit value for address
- n x 8 bytes application memory
- No provider storage required
- Direct addressing possible
Converts portable addressing
(e.g. hostname or sockaddr)
to fabric specific address
Possible to share AV
between processes
User Address
IP:Port
10.0.0.1:7000
10.0.0.1:7001
10.0.0.2:7000
10.0.0.3:7003
…
example mappings
DATA TRANSFER TYPES
15
msg 2
msg 1msg 3
msg 2
msg 1msg 3
tag 2
tag 1tag 3
tag 2
tag 1 tag 3
TAGGED
MSG
Maintain message
boundaries, FIFO
Messages carry
user ‘tag’ or id
Receiver selects which
tag goes with each buffer
DATA TRANSFER TYPES
16
data 2
data 1data 3
data 1
data 2
data 3
MSG Stream
dgram
dgram
dgram
dgram
dgram
Multicast MSG
Send to or receive
from multicast group
Data sent and received as
‘stream’
(no message boundaries)
Uses ‘MSG’ APIs but different
endpoint capabilities
Synchronous completion
semantics (application
always owns buffer)
17
DATA TRANSFER TYPES
write 2
write 1write 3 write 1
write 3
write 2
RMA
RDMA semantics
Direct reads or writes of remote
memory from user perspective
Specify operation to perform
on selected datatype
f(x,y)
ATOMIC
f(x,y)
f(x,y)
f(x,y)
f()
y
g()
y
x
g(x,y)
g(x,y)
g(x,y)
x
x
Format of data at target is
known to fabric services
RXM PROVIDER
MPI / SHMEM
RxM RDM
MSG MSG MSG MSG
MSG
RDM
MSG
RDM
MSG
RDM
MSG
RDM
Verbs RC
NetworkDirect
TCP
OFI
OFI
High-priority
Broadly used
Primary path for HPC apps
accessing verbs hardware
Optimizes for
hardware features
• CQ data
• Inline data
• SRQ
Dynamic connections
Eager messages – small transfers
Segmentation and reassembly –
medium transfers
Rendezvous – large transfers
Memory registration cache
Ideal: tighter provider coupling
Connection
multiplexing
19
MPI / SHMEM
RxD RDM
DGRAM DGRAM
DGRAM
RDM
Verbs UD
usNIC
Raw Ethernet
OFI
OFI
Functional
Developing / optimizing
Path for HPC
scalability
Re-designing for performance
and scalability
Analyzing provider specific
optimizationsReliability, segmentation,
reassembly, flow control
UDP
Other..?
Fast development
path for hardware
support
DGRAM
RDM
DGRAM
RDM
DGRAM
RDM
Extend features
of simple RDM
provider
Offload large transfers
20
RXD PROVIDER
Version
Flags
PID
Region Size
Lock
Command Queue
Response Queue
Peer Address Map
Inject Buffers
SHM Provider
SMR
Shared
Memory Region
SMR SMR
Shared memory
primitives
One-sided and
two-sided transfers
CMA (cross-memory attach)
for large transfers
xpmem support
under consideration
Single command
queue
SHARED MEMORY PROVIDER
21
Available
PERFORMANCE MONITORING
Performance Data Set
Performance
Management Unit
CPU
Cache
NIC
Event Data
Count
Sum
Event Data
Count
Sum
Event Data
Count
Sum
Cycles
Instructions
Hits
Misses
Performance
‘domains’
?
Inline performance
tracking
Linux RDPMCEx: Sample CPU instructions
for various code paths
23
MULTI-RAIL PROVIDER
User
mRail
EP
EP 1 EP 2
EP 1
RDM
OFI
OFI
Active
EP 2
Under
development
Standby
Application or
admin configured
Increase bandwidth
and message rate
Failover
Rail selection
‘plug-in’
Require variable
message support
EP 1
RDM
EP 2
Multiple EPs,
ports, NICs, fabrics
Isolate rail
selection algorithm
One fi_info
structure per rail
TBD: recovery
fallback
24
ACCELERATORS AND HETEROGENEOUS MEMORY
CPU Memory PMEM
(Smart) NICPeer Device
(GPU)
FPGA
Device
Memory
Device
Memory
APIs assume memory
mapped regions
May not want to
write data through
CPU caches
Memory regions
may not be mapped
Results may be cached by
NIC for long transactions
CPU load/stores
Same coherency
domain
Programmable
offload capabilities
and flow processing
May need to sync
results with CPU
25
Accelerations may be
available over the fabric
Analyzing
DOWNLOAD LIBFABRIC AND FABTESTS
27
libfabric.org – redirects to github
Link to all releases
Link to validation tests (fabtests)
Man pages (scroll down)
INSTALLATION
28
$ tar xvf libfabric-1.7.0.tar.bz2
$ cd libfabric-1.7.0
$ ./configure
$ make
$ sudo make install
Follows common conventions
Github only has tar files
RPMs from distro or OFED
• By default will auto-detect which providers are
usable
• Can enable/disable providers
• e.g. --enable-verbs=yes
• --enable-debug
• Development build
• --help
FI_INFO
29
$ fi_info
$ fi_info –l (--list)
$ fi_info –v (--verbose)
$ fi_info --env
libfabric utility program
Condensed view of usable providers
List all installed providers
Detailed attributes of usable providers
List all environment variables
Description and effect, default values
FI_LOG_LEVEL=[debug | info | warn | …]
FI_PROVIDER – enable/disable providers
FI_PROVIDER=[tcp,udp,…]
FI_PROVIDER=^[udp,…]
FI_HOOK=[perf]
CAPABILITY AND MODE BITS
32
• Desired services requested by app
• Primary – app must request to use
• E.g. FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC
• Secondary – provider can indicate
availability
• E.g. FI_SOURCE, FI_MULTI_RECV
Capabilities
• Requirements placed on the app
• Improves performance when implemented by
application
• App indicates which modes it supports
• Provider clears modes not needed
• Sample:
FI_CONTEXT, FI_LOCAL_MR
Modes
ATTRIBUTES
struct fi_fabric_attr {
struct fid_fabric *fabric;
char *name;
char *prov_name;
uint32_t prov_version;
uint32_t api_version;
};
33
struct fi_domain_attr {
struct fid_domain *domain;
char *name;
enum fi_threading threading;
enum fi_progress control_progress;
enum fi_progress data_progress;
enum fi_resource_mgmt resource_mgmt;
enum fi_av_type av_type;
int mr_mode;
/* provider limits – fields omitted */
...
uint64_t caps;
uint64_t mode;
uint8_t *auth_key;
...
};
Provider details
Can also use env var to filter
Already opened resource
(if available)
How resources are
allocated among threads
for lockless access
Provider protects
against queue overruns
Do app threads
drive transfers
Secure communication
(job key)
ATTRIBUTES
struct fi_ep_attr {
enum fi_ep_type type;
uint32_t protocol;
uint32_t protocol_version;
size_t max_msg_size;
size_t msg_prefix_size;
size_t max_order_raw_size;
size_t max_order_war_size;
size_t max_order_waw_size;
uint64_t mem_tag_format;
size_t tx_ctx_cnt;
size_t rx_ctx_cnt;
size_t auth_key_size;
uint8_t *auth_key;
};
34
Indicates interoperability
Order of data placement
between two messages
Default, shared, or scalable
ATTRIBUTES
struct fi_tx_attr {
uint64_t caps;
uint64_t mode;
uint64_t op_flags;
uint64_t msg_order;
uint64_t comp_order;
size_t inject_size;
size_t size;
size_t iov_limit;
size_t rma_iov_limit;
};
35
struct fi_rx_attr {
uint64_t caps;
uint64_t mode;
uint64_t op_flags;
uint64_t msg_order;
uint64_t comp_order;
size_t total_buffered_recv;
size_t size;
size_t iov_limit;
};
Are completions
reported in order
“Fast”
message size
Can messages be sent
and received out of order
THANK YOU
Sean Hefty
OFIWG Co-Chair
• All things libfabric:
• www.libfabric.org
• Meetings:
• https://www.openfabrics.org/my-calendar/
• https://www.openfabrics.org/ofiwg-webex/
• Mailing list
• https://lists.openfabrics.org/mailman/listinfo/ofiwg
MEMORY MONITOR AND REGISTRATION CACHE
Notification
Queue
Notification
Queue
Notification
Queue
Memory
Monitor Core
Monitor
‘Plug-in’
MR Map
MR MR MR
Registration Cache
LRU List
Custom Limits
Usage Stats
Provider
Merges overlapping
regions
events
subscribe
Driver notification, hook
alloc/free, provider specific
Tracks
active usage
Internal
API
Get/put MRs
Callbacks to
add/delete MRs
A generic solution
is desired here
38
PERSISTENT MEMORY
User
Commit complete
RMA
Write
Persistent
Memory
User
Register
PMEM
PMEM MR
Keep implementation agnostic
• Handle offload and on-load models
• Support multi-rail
• Minimize state footprint
High-availability
model (v1.6)
Evolve APIs to support
other usage models
Documentation
limits use case
New completion
semantic
Exploration
• Byte addressable or object aware
• Single or multi-transfer commit
• Advanced operations (e.g. atomics)
Work with SNIA (Storage
Networking Industry Association)
39