SlideShare a Scribd company logo
1 of 39
OPENFABRICS INTERFACES LIBFABRIC
Sean Hefty, OFIWG Co-Chair
January, 2019
Intel Corporation
libfabric.org
AGENDA – MOVING FORWARD WITH LIBFABRIC
2
• Charter
Software Strategy
• Object Model
Architecture
• Utility Providers
Provider Support
• Samples
Getting Started
• Calling fi_getinfo()
API Bootstrap
Design motivations,
architecture, usage guidance
Developer
Guide
libfabric.org
DESIGN GUIDELINES
3
Open
Fabrics
Interface
Application
Requirements
Deployment
Requirements
Fabric Vendor
Requirements
Charter: develop interfaces
aligned with user needs
• Protect software investment
• Enable new features without
application changes
• Find the ‘right’ abstraction
• Ensure high-performance
• Analyze impact of changes
under a variety of assumptions
• Build in extensibility
OFI = a portable low-level
network interface
Optimized SW path to HW
•Minimize cache and memory footprint
•Reduce instruction count
•Minimize memory accesses
Scalable
Implementation
Agnostic
Software interfaces aligned with
user requirements
•Careful requirement analysis
Inclusive development effort
•App and HW developers
Good impedance match with
multiple fabric hardware
•InfiniBand, iWarp, RoCE, raw Ethernet,
UDP offload, Omni-Path, GNI, BGQ, …
Open Source User-Centric
libfabric
User-centric interfaces will help foster fabric
innovation and accelerate their adoption
4
MPI CASE STUDY
 MPICH MPI implementation
 CH3 – internal portable network
API
• Lack of clear network abstractions led to poor
performance related choices
 CH4 – revised internal layering
• Driven based on work done for OFI
• 6x+ reduction in software overhead!
5
Moral: in the absence of a good abstraction, each application
will invent their own, and may result in lower performance
Image source: K. Raffenetti et al. Why is MPI So Slow? Analyzing the
Fundamental Limits in Implementing MPI-3.1. SC ’17
MPI_Put: 1342->44
MPI_Isend: 253->59
OFI USER REQUIREMENTS
6
Give us a high-
level interface!
Give us a low-
level interface!
MPI developers
OFI strives to meet
both requirements
Middleware is primary user
Looking to expand
beyond HPC
GURUEASY
OFI SOFTWARE DEVELOPMENT STRATEGIES
One Size Does Not Fit All
7
Fabric Services
User
OFI
Provider
User
OFI
Provider
Provider optimizes for
OFI features
Common optimization
for all apps/providers
Client uses OFI features
User
OFI
Provider
Client optimizes based
on supported features
Provider supports low-level features only
Linux, FreeBSD,
OS X, Windows
TCP and UDP
development support
GURUEASY
libfabric
OFI LIBFABRIC COMMUNITY
8
Because of an OFI-provider gap, not
all apps work with all providers
Intel® MPI
Library
MPICH
Open MPI
SHMEM
Sandia
SHMEM
GASNet
Clang
UPC
libfabric Enabled Middleware
Sockets
TCP, UDP
Verbs
Cisco
usNIC
Intel
OPA, PSM
Cray
GNI
AWS EFA
IBM Blue
Gene
Shared
Memory
* Other names and brands may be claimed as the property of others
RxM, RxD, Multi-
Rail, Hooks, …
PMDK, Spark, ZeroMQ, TensorFlow,
MxNET, NetIO, Intel MLSL, rsockets …
Charm++ Chapel
HPE
Gen-Z
Advanced application oriented semantics
Tag
Matching
Scalable memory
registration
Triggered
Operations
Multi-Receive
buffers
Reliable Datagram
Endpoints
Remote Completion
Semantics
Streaming
Endpoints
Shared
Addressing
Unexpected
Message Buffering
Network
Direct
OFI insulates applications
from fabric diversity
Providers
Persistent
Memory
SmartNICs and
FPGAs Acceleration
ARCHITECTURE
9
ARCHITECTURE
10
Modes
Capabilities
OBJECT-MODEL
11
OFI only defines
semantic requirements
NIC
Network
Peer address
table
Listener
Command
queues
RDMA
buffers
Manage multiple
CQs and counters
Share wait
objects (fd’s)
example mappings
ENDPOINT TYPES
12
Unconnected
Connected
ENDPOINT CONTEXTS
13
Default
Scalable
Endpoints
Shared
Contexts
Tx/Rx completions
may go to the same
or different CQs
Tx/Rx command
‘queues’
Share underlying
command queues
Targets multi-thread
access to hardware
ADDRESS VECTORS
14
Address Vector (Table)
fi_addr Fabric Address
0 100:3:50
1 100:3:51
2 101:3:83
3 102:3:64
… …
Address Vector (Map)
fi_addr
100003050
100003051
101003083
102003064
…
Addresses are referenced by an index
- No application storage required
- O(n) memory in provider
- Lookup required on transfers
OFI returns 64-bit value for address
- n x 8 bytes application memory
- No provider storage required
- Direct addressing possible
Converts portable addressing
(e.g. hostname or sockaddr)
to fabric specific address
Possible to share AV
between processes
User Address
IP:Port
10.0.0.1:7000
10.0.0.1:7001
10.0.0.2:7000
10.0.0.3:7003
…
example mappings
DATA TRANSFER TYPES
15
msg 2
msg 1msg 3
msg 2
msg 1msg 3
tag 2
tag 1tag 3
tag 2
tag 1 tag 3
TAGGED
MSG
Maintain message
boundaries, FIFO
Messages carry
user ‘tag’ or id
Receiver selects which
tag goes with each buffer
DATA TRANSFER TYPES
16
data 2
data 1data 3
data 1
data 2
data 3
MSG Stream
dgram
dgram
dgram
dgram
dgram
Multicast MSG
Send to or receive
from multicast group
Data sent and received as
‘stream’
(no message boundaries)
Uses ‘MSG’ APIs but different
endpoint capabilities
Synchronous completion
semantics (application
always owns buffer)
17
DATA TRANSFER TYPES
write 2
write 1write 3 write 1
write 3
write 2
RMA
RDMA semantics
Direct reads or writes of remote
memory from user perspective
Specify operation to perform
on selected datatype
f(x,y)
ATOMIC
f(x,y)
f(x,y)
f(x,y)
f()
y
g()
y
x
g(x,y)
g(x,y)
g(x,y)
x
x
Format of data at target is
known to fabric services
PROVIDER ARCHITECTURE
18
RXM PROVIDER
MPI / SHMEM
RxM RDM
MSG MSG MSG MSG
MSG
RDM
MSG
RDM
MSG
RDM
MSG
RDM
Verbs RC
NetworkDirect
TCP
OFI
OFI
High-priority
Broadly used
Primary path for HPC apps
accessing verbs hardware
Optimizes for
hardware features
• CQ data
• Inline data
• SRQ
 Dynamic connections
 Eager messages – small transfers
 Segmentation and reassembly –
medium transfers
 Rendezvous – large transfers
 Memory registration cache
 Ideal: tighter provider coupling
Connection
multiplexing
19
MPI / SHMEM
RxD RDM
DGRAM DGRAM
DGRAM
RDM
Verbs UD
usNIC
Raw Ethernet
OFI
OFI
Functional
Developing / optimizing
Path for HPC
scalability
 Re-designing for performance
and scalability
 Analyzing provider specific
optimizationsReliability, segmentation,
reassembly, flow control
UDP
Other..?
Fast development
path for hardware
support
DGRAM
RDM
DGRAM
RDM
DGRAM
RDM
Extend features
of simple RDM
provider
Offload large transfers
20
RXD PROVIDER
Version
Flags
PID
Region Size
Lock
Command Queue
Response Queue
Peer Address Map
Inject Buffers
SHM Provider
SMR
Shared
Memory Region
SMR SMR
Shared memory
primitives
One-sided and
two-sided transfers
CMA (cross-memory attach)
for large transfers
xpmem support
under consideration
Single command
queue
SHARED MEMORY PROVIDER
21
Available
HOOKING PROVIDER
Hook
Zero-impact
unless enabled
User
OFI
Core/Util Provider
OFI Core
Always available –
release and debug builds
Intercept calls
to any provider
Debugging, performance analysis,
feature enhancements, testing
22
PERFORMANCE MONITORING
Performance Data Set
Performance
Management Unit
CPU
Cache
NIC
Event Data
Count
Sum
Event Data
Count
Sum
Event Data
Count
Sum
Cycles
Instructions
Hits
Misses
Performance
‘domains’
?
Inline performance
tracking
Linux RDPMCEx: Sample CPU instructions
for various code paths
23
MULTI-RAIL PROVIDER
User
mRail
EP
EP 1 EP 2
EP 1
RDM
OFI
OFI
Active
EP 2
Under
development
Standby
Application or
admin configured
Increase bandwidth
and message rate
Failover
Rail selection
‘plug-in’
Require variable
message support
EP 1
RDM
EP 2
Multiple EPs,
ports, NICs, fabrics
Isolate rail
selection algorithm
One fi_info
structure per rail
TBD: recovery
fallback
24
ACCELERATORS AND HETEROGENEOUS MEMORY
CPU Memory PMEM
(Smart) NICPeer Device
(GPU)
FPGA
Device
Memory
Device
Memory
APIs assume memory
mapped regions
May not want to
write data through
CPU caches
Memory regions
may not be mapped
Results may be cached by
NIC for long transactions
CPU load/stores
Same coherency
domain
Programmable
offload capabilities
and flow processing
May need to sync
results with CPU
25
Accelerations may be
available over the fabric
Analyzing
GETTING STARTED
26
DOWNLOAD LIBFABRIC AND FABTESTS
27
libfabric.org – redirects to github
Link to all releases
Link to validation tests (fabtests)
Man pages (scroll down)
INSTALLATION
28
$ tar xvf libfabric-1.7.0.tar.bz2
$ cd libfabric-1.7.0
$ ./configure
$ make
$ sudo make install
Follows common conventions
Github only has tar files
RPMs from distro or OFED
• By default will auto-detect which providers are
usable
• Can enable/disable providers
• e.g. --enable-verbs=yes
• --enable-debug
• Development build
• --help
FI_INFO
29
$ fi_info
$ fi_info –l (--list)
$ fi_info –v (--verbose)
$ fi_info --env
libfabric utility program
Condensed view of usable providers
List all installed providers
Detailed attributes of usable providers
List all environment variables
Description and effect, default values
FI_LOG_LEVEL=[debug | info | warn | …]
FI_PROVIDER – enable/disable providers
FI_PROVIDER=[tcp,udp,…]
FI_PROVIDER=^[udp,…]
FI_HOOK=[perf]
API BOOTSTRAP
30
FI_GETINFO
struct fi_info *fi_allocinfo(void);
int fi_getinfo(
uint32_t version,
const char *node,
const char *service,
uint64_t flags,
struct fi_info *hints,
struct fi_info **info);
void fi_freeinfo(
struct fi_info *info);
31
struct fi_info {
struct fi_info *next;
uint64_t caps;
uint64_t mode;
uint32_t addr_format;
size_t src_addrlen;
size_t dest_addrlen;
void *src_addr;
void *dest_addr;
fid_t handle;
struct fi_tx_attr *tx_attr;
struct fi_rx_attr *rx_attr;
struct fi_ep_attr *ep_attr;
struct fi_domain_attr*domain_attr;
struct fi_fabric_attr*fabric_attr;
};
API version
~getaddrinfo
app needs
API semantics needed, and provider
requirements for using them
Detailed object attributes
CAPABILITY AND MODE BITS
32
• Desired services requested by app
• Primary – app must request to use
• E.g. FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC
• Secondary – provider can indicate
availability
• E.g. FI_SOURCE, FI_MULTI_RECV
Capabilities
• Requirements placed on the app
• Improves performance when implemented by
application
• App indicates which modes it supports
• Provider clears modes not needed
• Sample:
FI_CONTEXT, FI_LOCAL_MR
Modes
ATTRIBUTES
struct fi_fabric_attr {
struct fid_fabric *fabric;
char *name;
char *prov_name;
uint32_t prov_version;
uint32_t api_version;
};
33
struct fi_domain_attr {
struct fid_domain *domain;
char *name;
enum fi_threading threading;
enum fi_progress control_progress;
enum fi_progress data_progress;
enum fi_resource_mgmt resource_mgmt;
enum fi_av_type av_type;
int mr_mode;
/* provider limits – fields omitted */
...
uint64_t caps;
uint64_t mode;
uint8_t *auth_key;
...
};
Provider details
Can also use env var to filter
Already opened resource
(if available)
How resources are
allocated among threads
for lockless access
Provider protects
against queue overruns
Do app threads
drive transfers
Secure communication
(job key)
ATTRIBUTES
struct fi_ep_attr {
enum fi_ep_type type;
uint32_t protocol;
uint32_t protocol_version;
size_t max_msg_size;
size_t msg_prefix_size;
size_t max_order_raw_size;
size_t max_order_war_size;
size_t max_order_waw_size;
uint64_t mem_tag_format;
size_t tx_ctx_cnt;
size_t rx_ctx_cnt;
size_t auth_key_size;
uint8_t *auth_key;
};
34
Indicates interoperability
Order of data placement
between two messages
Default, shared, or scalable
ATTRIBUTES
struct fi_tx_attr {
uint64_t caps;
uint64_t mode;
uint64_t op_flags;
uint64_t msg_order;
uint64_t comp_order;
size_t inject_size;
size_t size;
size_t iov_limit;
size_t rma_iov_limit;
};
35
struct fi_rx_attr {
uint64_t caps;
uint64_t mode;
uint64_t op_flags;
uint64_t msg_order;
uint64_t comp_order;
size_t total_buffered_recv;
size_t size;
size_t iov_limit;
};
Are completions
reported in order
“Fast”
message size
Can messages be sent
and received out of order
THANK YOU
Sean Hefty
OFIWG Co-Chair
• All things libfabric:
• www.libfabric.org
• Meetings:
• https://www.openfabrics.org/my-calendar/
• https://www.openfabrics.org/ofiwg-webex/
• Mailing list
• https://lists.openfabrics.org/mailman/listinfo/ofiwg
LEGAL DISCLAIMER & OPTIMIZATION NOTICE
37
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors.
These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for
use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the
applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of
that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
 INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE,
TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER
AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
 Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are
trademarks of Intel Corporation in the U.S. and other countries.
MEMORY MONITOR AND REGISTRATION CACHE
Notification
Queue
Notification
Queue
Notification
Queue
Memory
Monitor Core
Monitor
‘Plug-in’
MR Map
MR MR MR
Registration Cache
LRU List
Custom Limits
Usage Stats
Provider
Merges overlapping
regions
events
subscribe
Driver notification, hook
alloc/free, provider specific
Tracks
active usage
Internal
API
Get/put MRs
Callbacks to
add/delete MRs
A generic solution
is desired here
38
PERSISTENT MEMORY
User
Commit complete
RMA
Write
Persistent
Memory
User
Register
PMEM
PMEM MR
 Keep implementation agnostic
• Handle offload and on-load models
• Support multi-rail
• Minimize state footprint
High-availability
model (v1.6)
Evolve APIs to support
other usage models
Documentation
limits use case
New completion
semantic
 Exploration
• Byte addressable or object aware
• Single or multi-transfer commit
• Advanced operations (e.g. atomics)
Work with SNIA (Storage
Networking Industry Association)
39

More Related Content

What's hot

SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction ProtocolSHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
inside-BigData.com
 

What's hot (20)

Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
HAProxy
HAProxy HAProxy
HAProxy
 
Understanding DPDK algorithmics
Understanding DPDK algorithmicsUnderstanding DPDK algorithmics
Understanding DPDK algorithmics
 
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Spring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise PlatformSpring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise Platform
 
DIY Netflow Data Analytic with ELK Stack by CL Lee
DIY Netflow Data Analytic with ELK Stack by CL LeeDIY Netflow Data Analytic with ELK Stack by CL Lee
DIY Netflow Data Analytic with ELK Stack by CL Lee
 
Kubernetes networking: Introduction to overlay networks, communication models...
Kubernetes networking: Introduction to overlay networks, communication models...Kubernetes networking: Introduction to overlay networks, communication models...
Kubernetes networking: Introduction to overlay networks, communication models...
 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFT
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
大規模サービスを支えるネットワークインフラの全貌
大規模サービスを支えるネットワークインフラの全貌大規模サービスを支えるネットワークインフラの全貌
大規模サービスを支えるネットワークインフラの全貌
 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and Monitoring
 
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction ProtocolSHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
 
Implementation & Comparison Of Rdma Over Ethernet
Implementation & Comparison Of Rdma Over EthernetImplementation & Comparison Of Rdma Over Ethernet
Implementation & Comparison Of Rdma Over Ethernet
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
 

Similar to OFI Overview 2019 Webinar

Similar to OFI Overview 2019 Webinar (20)

Intel the-latest-on-ofi
Intel the-latest-on-ofiIntel the-latest-on-ofi
Intel the-latest-on-ofi
 
Intel the-latest-on-ofi
Intel the-latest-on-ofiIntel the-latest-on-ofi
Intel the-latest-on-ofi
 
Learn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFVLearn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFV
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
 
Up and Running with gRPC & Cloud Career [GDG-Cloud-Dhaka-IO/2022}
Up and Running with gRPC & Cloud Career [GDG-Cloud-Dhaka-IO/2022}Up and Running with gRPC & Cloud Career [GDG-Cloud-Dhaka-IO/2022}
Up and Running with gRPC & Cloud Career [GDG-Cloud-Dhaka-IO/2022}
 
Advancing OpenFabrics Interfaces
Advancing OpenFabrics InterfacesAdvancing OpenFabrics Interfaces
Advancing OpenFabrics Interfaces
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
 
OpenFabrics Interfaces introduction
OpenFabrics Interfaces introductionOpenFabrics Interfaces introduction
OpenFabrics Interfaces introduction
 
509 512
509 512509 512
509 512
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
 
Amazon Elastic Fabric Adapter: Anatomy, Capabilities, and the Road Ahead
Amazon Elastic Fabric Adapter: Anatomy, Capabilities, and the Road AheadAmazon Elastic Fabric Adapter: Anatomy, Capabilities, and the Road Ahead
Amazon Elastic Fabric Adapter: Anatomy, Capabilities, and the Road Ahead
 
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentationHiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
 
What's New in RHEL 6 for Linux on System z?
What's New in RHEL 6 for Linux on System z?What's New in RHEL 6 for Linux on System z?
What's New in RHEL 6 for Linux on System z?
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
2008-09-09 IBM Interaction Conference, Red Hat Update for System z
2008-09-09 IBM Interaction Conference, Red Hat Update for System z2008-09-09 IBM Interaction Conference, Red Hat Update for System z
2008-09-09 IBM Interaction Conference, Red Hat Update for System z
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin  Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin
 
G rpc talk with intel (3)
G rpc talk with intel (3)G rpc talk with intel (3)
G rpc talk with intel (3)
 
Building Killer RESTful APIs with NodeJs
Building Killer RESTful APIs with NodeJsBuilding Killer RESTful APIs with NodeJs
Building Killer RESTful APIs with NodeJs
 

Recently uploaded

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
Max Lee
 

Recently uploaded (20)

KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
Malaysia E-Invoice digital signature docpptx
Malaysia E-Invoice digital signature docpptxMalaysia E-Invoice digital signature docpptx
Malaysia E-Invoice digital signature docpptx
 
AI Hackathon.pptx
AI                        Hackathon.pptxAI                        Hackathon.pptx
AI Hackathon.pptx
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
Naer Toolbar Redesign - Usability Research Synthesis
Naer Toolbar Redesign - Usability Research SynthesisNaer Toolbar Redesign - Usability Research Synthesis
Naer Toolbar Redesign - Usability Research Synthesis
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdf
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data Migration
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
 
Microsoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMicrosoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdf
 
The Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionThe Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion Production
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 

OFI Overview 2019 Webinar

  • 1. OPENFABRICS INTERFACES LIBFABRIC Sean Hefty, OFIWG Co-Chair January, 2019 Intel Corporation libfabric.org
  • 2. AGENDA – MOVING FORWARD WITH LIBFABRIC 2 • Charter Software Strategy • Object Model Architecture • Utility Providers Provider Support • Samples Getting Started • Calling fi_getinfo() API Bootstrap Design motivations, architecture, usage guidance Developer Guide libfabric.org
  • 3. DESIGN GUIDELINES 3 Open Fabrics Interface Application Requirements Deployment Requirements Fabric Vendor Requirements Charter: develop interfaces aligned with user needs • Protect software investment • Enable new features without application changes • Find the ‘right’ abstraction • Ensure high-performance • Analyze impact of changes under a variety of assumptions • Build in extensibility OFI = a portable low-level network interface
  • 4. Optimized SW path to HW •Minimize cache and memory footprint •Reduce instruction count •Minimize memory accesses Scalable Implementation Agnostic Software interfaces aligned with user requirements •Careful requirement analysis Inclusive development effort •App and HW developers Good impedance match with multiple fabric hardware •InfiniBand, iWarp, RoCE, raw Ethernet, UDP offload, Omni-Path, GNI, BGQ, … Open Source User-Centric libfabric User-centric interfaces will help foster fabric innovation and accelerate their adoption 4
  • 5. MPI CASE STUDY  MPICH MPI implementation  CH3 – internal portable network API • Lack of clear network abstractions led to poor performance related choices  CH4 – revised internal layering • Driven based on work done for OFI • 6x+ reduction in software overhead! 5 Moral: in the absence of a good abstraction, each application will invent their own, and may result in lower performance Image source: K. Raffenetti et al. Why is MPI So Slow? Analyzing the Fundamental Limits in Implementing MPI-3.1. SC ’17 MPI_Put: 1342->44 MPI_Isend: 253->59
  • 6. OFI USER REQUIREMENTS 6 Give us a high- level interface! Give us a low- level interface! MPI developers OFI strives to meet both requirements Middleware is primary user Looking to expand beyond HPC GURUEASY
  • 7. OFI SOFTWARE DEVELOPMENT STRATEGIES One Size Does Not Fit All 7 Fabric Services User OFI Provider User OFI Provider Provider optimizes for OFI features Common optimization for all apps/providers Client uses OFI features User OFI Provider Client optimizes based on supported features Provider supports low-level features only Linux, FreeBSD, OS X, Windows TCP and UDP development support GURUEASY
  • 8. libfabric OFI LIBFABRIC COMMUNITY 8 Because of an OFI-provider gap, not all apps work with all providers Intel® MPI Library MPICH Open MPI SHMEM Sandia SHMEM GASNet Clang UPC libfabric Enabled Middleware Sockets TCP, UDP Verbs Cisco usNIC Intel OPA, PSM Cray GNI AWS EFA IBM Blue Gene Shared Memory * Other names and brands may be claimed as the property of others RxM, RxD, Multi- Rail, Hooks, … PMDK, Spark, ZeroMQ, TensorFlow, MxNET, NetIO, Intel MLSL, rsockets … Charm++ Chapel HPE Gen-Z Advanced application oriented semantics Tag Matching Scalable memory registration Triggered Operations Multi-Receive buffers Reliable Datagram Endpoints Remote Completion Semantics Streaming Endpoints Shared Addressing Unexpected Message Buffering Network Direct OFI insulates applications from fabric diversity Providers Persistent Memory SmartNICs and FPGAs Acceleration
  • 11. OBJECT-MODEL 11 OFI only defines semantic requirements NIC Network Peer address table Listener Command queues RDMA buffers Manage multiple CQs and counters Share wait objects (fd’s) example mappings
  • 13. ENDPOINT CONTEXTS 13 Default Scalable Endpoints Shared Contexts Tx/Rx completions may go to the same or different CQs Tx/Rx command ‘queues’ Share underlying command queues Targets multi-thread access to hardware
  • 14. ADDRESS VECTORS 14 Address Vector (Table) fi_addr Fabric Address 0 100:3:50 1 100:3:51 2 101:3:83 3 102:3:64 … … Address Vector (Map) fi_addr 100003050 100003051 101003083 102003064 … Addresses are referenced by an index - No application storage required - O(n) memory in provider - Lookup required on transfers OFI returns 64-bit value for address - n x 8 bytes application memory - No provider storage required - Direct addressing possible Converts portable addressing (e.g. hostname or sockaddr) to fabric specific address Possible to share AV between processes User Address IP:Port 10.0.0.1:7000 10.0.0.1:7001 10.0.0.2:7000 10.0.0.3:7003 … example mappings
  • 15. DATA TRANSFER TYPES 15 msg 2 msg 1msg 3 msg 2 msg 1msg 3 tag 2 tag 1tag 3 tag 2 tag 1 tag 3 TAGGED MSG Maintain message boundaries, FIFO Messages carry user ‘tag’ or id Receiver selects which tag goes with each buffer
  • 16. DATA TRANSFER TYPES 16 data 2 data 1data 3 data 1 data 2 data 3 MSG Stream dgram dgram dgram dgram dgram Multicast MSG Send to or receive from multicast group Data sent and received as ‘stream’ (no message boundaries) Uses ‘MSG’ APIs but different endpoint capabilities Synchronous completion semantics (application always owns buffer)
  • 17. 17 DATA TRANSFER TYPES write 2 write 1write 3 write 1 write 3 write 2 RMA RDMA semantics Direct reads or writes of remote memory from user perspective Specify operation to perform on selected datatype f(x,y) ATOMIC f(x,y) f(x,y) f(x,y) f() y g() y x g(x,y) g(x,y) g(x,y) x x Format of data at target is known to fabric services
  • 19. RXM PROVIDER MPI / SHMEM RxM RDM MSG MSG MSG MSG MSG RDM MSG RDM MSG RDM MSG RDM Verbs RC NetworkDirect TCP OFI OFI High-priority Broadly used Primary path for HPC apps accessing verbs hardware Optimizes for hardware features • CQ data • Inline data • SRQ  Dynamic connections  Eager messages – small transfers  Segmentation and reassembly – medium transfers  Rendezvous – large transfers  Memory registration cache  Ideal: tighter provider coupling Connection multiplexing 19
  • 20. MPI / SHMEM RxD RDM DGRAM DGRAM DGRAM RDM Verbs UD usNIC Raw Ethernet OFI OFI Functional Developing / optimizing Path for HPC scalability  Re-designing for performance and scalability  Analyzing provider specific optimizationsReliability, segmentation, reassembly, flow control UDP Other..? Fast development path for hardware support DGRAM RDM DGRAM RDM DGRAM RDM Extend features of simple RDM provider Offload large transfers 20 RXD PROVIDER
  • 21. Version Flags PID Region Size Lock Command Queue Response Queue Peer Address Map Inject Buffers SHM Provider SMR Shared Memory Region SMR SMR Shared memory primitives One-sided and two-sided transfers CMA (cross-memory attach) for large transfers xpmem support under consideration Single command queue SHARED MEMORY PROVIDER 21 Available
  • 22. HOOKING PROVIDER Hook Zero-impact unless enabled User OFI Core/Util Provider OFI Core Always available – release and debug builds Intercept calls to any provider Debugging, performance analysis, feature enhancements, testing 22
  • 23. PERFORMANCE MONITORING Performance Data Set Performance Management Unit CPU Cache NIC Event Data Count Sum Event Data Count Sum Event Data Count Sum Cycles Instructions Hits Misses Performance ‘domains’ ? Inline performance tracking Linux RDPMCEx: Sample CPU instructions for various code paths 23
  • 24. MULTI-RAIL PROVIDER User mRail EP EP 1 EP 2 EP 1 RDM OFI OFI Active EP 2 Under development Standby Application or admin configured Increase bandwidth and message rate Failover Rail selection ‘plug-in’ Require variable message support EP 1 RDM EP 2 Multiple EPs, ports, NICs, fabrics Isolate rail selection algorithm One fi_info structure per rail TBD: recovery fallback 24
  • 25. ACCELERATORS AND HETEROGENEOUS MEMORY CPU Memory PMEM (Smart) NICPeer Device (GPU) FPGA Device Memory Device Memory APIs assume memory mapped regions May not want to write data through CPU caches Memory regions may not be mapped Results may be cached by NIC for long transactions CPU load/stores Same coherency domain Programmable offload capabilities and flow processing May need to sync results with CPU 25 Accelerations may be available over the fabric Analyzing
  • 27. DOWNLOAD LIBFABRIC AND FABTESTS 27 libfabric.org – redirects to github Link to all releases Link to validation tests (fabtests) Man pages (scroll down)
  • 28. INSTALLATION 28 $ tar xvf libfabric-1.7.0.tar.bz2 $ cd libfabric-1.7.0 $ ./configure $ make $ sudo make install Follows common conventions Github only has tar files RPMs from distro or OFED • By default will auto-detect which providers are usable • Can enable/disable providers • e.g. --enable-verbs=yes • --enable-debug • Development build • --help
  • 29. FI_INFO 29 $ fi_info $ fi_info –l (--list) $ fi_info –v (--verbose) $ fi_info --env libfabric utility program Condensed view of usable providers List all installed providers Detailed attributes of usable providers List all environment variables Description and effect, default values FI_LOG_LEVEL=[debug | info | warn | …] FI_PROVIDER – enable/disable providers FI_PROVIDER=[tcp,udp,…] FI_PROVIDER=^[udp,…] FI_HOOK=[perf]
  • 31. FI_GETINFO struct fi_info *fi_allocinfo(void); int fi_getinfo( uint32_t version, const char *node, const char *service, uint64_t flags, struct fi_info *hints, struct fi_info **info); void fi_freeinfo( struct fi_info *info); 31 struct fi_info { struct fi_info *next; uint64_t caps; uint64_t mode; uint32_t addr_format; size_t src_addrlen; size_t dest_addrlen; void *src_addr; void *dest_addr; fid_t handle; struct fi_tx_attr *tx_attr; struct fi_rx_attr *rx_attr; struct fi_ep_attr *ep_attr; struct fi_domain_attr*domain_attr; struct fi_fabric_attr*fabric_attr; }; API version ~getaddrinfo app needs API semantics needed, and provider requirements for using them Detailed object attributes
  • 32. CAPABILITY AND MODE BITS 32 • Desired services requested by app • Primary – app must request to use • E.g. FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC • Secondary – provider can indicate availability • E.g. FI_SOURCE, FI_MULTI_RECV Capabilities • Requirements placed on the app • Improves performance when implemented by application • App indicates which modes it supports • Provider clears modes not needed • Sample: FI_CONTEXT, FI_LOCAL_MR Modes
  • 33. ATTRIBUTES struct fi_fabric_attr { struct fid_fabric *fabric; char *name; char *prov_name; uint32_t prov_version; uint32_t api_version; }; 33 struct fi_domain_attr { struct fid_domain *domain; char *name; enum fi_threading threading; enum fi_progress control_progress; enum fi_progress data_progress; enum fi_resource_mgmt resource_mgmt; enum fi_av_type av_type; int mr_mode; /* provider limits – fields omitted */ ... uint64_t caps; uint64_t mode; uint8_t *auth_key; ... }; Provider details Can also use env var to filter Already opened resource (if available) How resources are allocated among threads for lockless access Provider protects against queue overruns Do app threads drive transfers Secure communication (job key)
  • 34. ATTRIBUTES struct fi_ep_attr { enum fi_ep_type type; uint32_t protocol; uint32_t protocol_version; size_t max_msg_size; size_t msg_prefix_size; size_t max_order_raw_size; size_t max_order_war_size; size_t max_order_waw_size; uint64_t mem_tag_format; size_t tx_ctx_cnt; size_t rx_ctx_cnt; size_t auth_key_size; uint8_t *auth_key; }; 34 Indicates interoperability Order of data placement between two messages Default, shared, or scalable
  • 35. ATTRIBUTES struct fi_tx_attr { uint64_t caps; uint64_t mode; uint64_t op_flags; uint64_t msg_order; uint64_t comp_order; size_t inject_size; size_t size; size_t iov_limit; size_t rma_iov_limit; }; 35 struct fi_rx_attr { uint64_t caps; uint64_t mode; uint64_t op_flags; uint64_t msg_order; uint64_t comp_order; size_t total_buffered_recv; size_t size; size_t iov_limit; }; Are completions reported in order “Fast” message size Can messages be sent and received out of order
  • 36. THANK YOU Sean Hefty OFIWG Co-Chair • All things libfabric: • www.libfabric.org • Meetings: • https://www.openfabrics.org/my-calendar/ • https://www.openfabrics.org/ofiwg-webex/ • Mailing list • https://lists.openfabrics.org/mailman/listinfo/ofiwg
  • 37. LEGAL DISCLAIMER & OPTIMIZATION NOTICE 37 Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804  Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.  INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.  Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
  • 38. MEMORY MONITOR AND REGISTRATION CACHE Notification Queue Notification Queue Notification Queue Memory Monitor Core Monitor ‘Plug-in’ MR Map MR MR MR Registration Cache LRU List Custom Limits Usage Stats Provider Merges overlapping regions events subscribe Driver notification, hook alloc/free, provider specific Tracks active usage Internal API Get/put MRs Callbacks to add/delete MRs A generic solution is desired here 38
  • 39. PERSISTENT MEMORY User Commit complete RMA Write Persistent Memory User Register PMEM PMEM MR  Keep implementation agnostic • Handle offload and on-load models • Support multi-rail • Minimize state footprint High-availability model (v1.6) Evolve APIs to support other usage models Documentation limits use case New completion semantic  Exploration • Byte addressable or object aware • Single or multi-transfer commit • Advanced operations (e.g. atomics) Work with SNIA (Storage Networking Industry Association) 39