More Related Content

Recently uploaded(20)

OFI Overview 2019 Webinar

  1. OPENFABRICS INTERFACES LIBFABRIC Sean Hefty, OFIWG Co-Chair January, 2019 Intel Corporation
  2. AGENDA – MOVING FORWARD WITH LIBFABRIC 2 • Charter Software Strategy • Object Model Architecture • Utility Providers Provider Support • Samples Getting Started • Calling fi_getinfo() API Bootstrap Design motivations, architecture, usage guidance Developer Guide
  3. DESIGN GUIDELINES 3 Open Fabrics Interface Application Requirements Deployment Requirements Fabric Vendor Requirements Charter: develop interfaces aligned with user needs • Protect software investment • Enable new features without application changes • Find the ‘right’ abstraction • Ensure high-performance • Analyze impact of changes under a variety of assumptions • Build in extensibility OFI = a portable low-level network interface
  4. Optimized SW path to HW •Minimize cache and memory footprint •Reduce instruction count •Minimize memory accesses Scalable Implementation Agnostic Software interfaces aligned with user requirements •Careful requirement analysis Inclusive development effort •App and HW developers Good impedance match with multiple fabric hardware •InfiniBand, iWarp, RoCE, raw Ethernet, UDP offload, Omni-Path, GNI, BGQ, … Open Source User-Centric libfabric User-centric interfaces will help foster fabric innovation and accelerate their adoption 4
  5. MPI CASE STUDY  MPICH MPI implementation  CH3 – internal portable network API • Lack of clear network abstractions led to poor performance related choices  CH4 – revised internal layering • Driven based on work done for OFI • 6x+ reduction in software overhead! 5 Moral: in the absence of a good abstraction, each application will invent their own, and may result in lower performance Image source: K. Raffenetti et al. Why is MPI So Slow? Analyzing the Fundamental Limits in Implementing MPI-3.1. SC ’17 MPI_Put: 1342->44 MPI_Isend: 253->59
  6. OFI USER REQUIREMENTS 6 Give us a high- level interface! Give us a low- level interface! MPI developers OFI strives to meet both requirements Middleware is primary user Looking to expand beyond HPC GURUEASY
  7. OFI SOFTWARE DEVELOPMENT STRATEGIES One Size Does Not Fit All 7 Fabric Services User OFI Provider User OFI Provider Provider optimizes for OFI features Common optimization for all apps/providers Client uses OFI features User OFI Provider Client optimizes based on supported features Provider supports low-level features only Linux, FreeBSD, OS X, Windows TCP and UDP development support GURUEASY
  8. libfabric OFI LIBFABRIC COMMUNITY 8 Because of an OFI-provider gap, not all apps work with all providers Intel® MPI Library MPICH Open MPI SHMEM Sandia SHMEM GASNet Clang UPC libfabric Enabled Middleware Sockets TCP, UDP Verbs Cisco usNIC Intel OPA, PSM Cray GNI AWS EFA IBM Blue Gene Shared Memory * Other names and brands may be claimed as the property of others RxM, RxD, Multi- Rail, Hooks, … PMDK, Spark, ZeroMQ, TensorFlow, MxNET, NetIO, Intel MLSL, rsockets … Charm++ Chapel HPE Gen-Z Advanced application oriented semantics Tag Matching Scalable memory registration Triggered Operations Multi-Receive buffers Reliable Datagram Endpoints Remote Completion Semantics Streaming Endpoints Shared Addressing Unexpected Message Buffering Network Direct OFI insulates applications from fabric diversity Providers Persistent Memory SmartNICs and FPGAs Acceleration
  10. ARCHITECTURE 10 Modes Capabilities
  11. OBJECT-MODEL 11 OFI only defines semantic requirements NIC Network Peer address table Listener Command queues RDMA buffers Manage multiple CQs and counters Share wait objects (fd’s) example mappings
  12. ENDPOINT TYPES 12 Unconnected Connected
  13. ENDPOINT CONTEXTS 13 Default Scalable Endpoints Shared Contexts Tx/Rx completions may go to the same or different CQs Tx/Rx command ‘queues’ Share underlying command queues Targets multi-thread access to hardware
  14. ADDRESS VECTORS 14 Address Vector (Table) fi_addr Fabric Address 0 100:3:50 1 100:3:51 2 101:3:83 3 102:3:64 … … Address Vector (Map) fi_addr 100003050 100003051 101003083 102003064 … Addresses are referenced by an index - No application storage required - O(n) memory in provider - Lookup required on transfers OFI returns 64-bit value for address - n x 8 bytes application memory - No provider storage required - Direct addressing possible Converts portable addressing (e.g. hostname or sockaddr) to fabric specific address Possible to share AV between processes User Address IP:Port … example mappings
  15. DATA TRANSFER TYPES 15 msg 2 msg 1msg 3 msg 2 msg 1msg 3 tag 2 tag 1tag 3 tag 2 tag 1 tag 3 TAGGED MSG Maintain message boundaries, FIFO Messages carry user ‘tag’ or id Receiver selects which tag goes with each buffer
  16. DATA TRANSFER TYPES 16 data 2 data 1data 3 data 1 data 2 data 3 MSG Stream dgram dgram dgram dgram dgram Multicast MSG Send to or receive from multicast group Data sent and received as ‘stream’ (no message boundaries) Uses ‘MSG’ APIs but different endpoint capabilities Synchronous completion semantics (application always owns buffer)
  17. 17 DATA TRANSFER TYPES write 2 write 1write 3 write 1 write 3 write 2 RMA RDMA semantics Direct reads or writes of remote memory from user perspective Specify operation to perform on selected datatype f(x,y) ATOMIC f(x,y) f(x,y) f(x,y) f() y g() y x g(x,y) g(x,y) g(x,y) x x Format of data at target is known to fabric services
  19. RXM PROVIDER MPI / SHMEM RxM RDM MSG MSG MSG MSG MSG RDM MSG RDM MSG RDM MSG RDM Verbs RC NetworkDirect TCP OFI OFI High-priority Broadly used Primary path for HPC apps accessing verbs hardware Optimizes for hardware features • CQ data • Inline data • SRQ  Dynamic connections  Eager messages – small transfers  Segmentation and reassembly – medium transfers  Rendezvous – large transfers  Memory registration cache  Ideal: tighter provider coupling Connection multiplexing 19
  20. MPI / SHMEM RxD RDM DGRAM DGRAM DGRAM RDM Verbs UD usNIC Raw Ethernet OFI OFI Functional Developing / optimizing Path for HPC scalability  Re-designing for performance and scalability  Analyzing provider specific optimizationsReliability, segmentation, reassembly, flow control UDP Other..? Fast development path for hardware support DGRAM RDM DGRAM RDM DGRAM RDM Extend features of simple RDM provider Offload large transfers 20 RXD PROVIDER
  21. Version Flags PID Region Size Lock Command Queue Response Queue Peer Address Map Inject Buffers SHM Provider SMR Shared Memory Region SMR SMR Shared memory primitives One-sided and two-sided transfers CMA (cross-memory attach) for large transfers xpmem support under consideration Single command queue SHARED MEMORY PROVIDER 21 Available
  22. HOOKING PROVIDER Hook Zero-impact unless enabled User OFI Core/Util Provider OFI Core Always available – release and debug builds Intercept calls to any provider Debugging, performance analysis, feature enhancements, testing 22
  23. PERFORMANCE MONITORING Performance Data Set Performance Management Unit CPU Cache NIC Event Data Count Sum Event Data Count Sum Event Data Count Sum Cycles Instructions Hits Misses Performance ‘domains’ ? Inline performance tracking Linux RDPMCEx: Sample CPU instructions for various code paths 23
  24. MULTI-RAIL PROVIDER User mRail EP EP 1 EP 2 EP 1 RDM OFI OFI Active EP 2 Under development Standby Application or admin configured Increase bandwidth and message rate Failover Rail selection ‘plug-in’ Require variable message support EP 1 RDM EP 2 Multiple EPs, ports, NICs, fabrics Isolate rail selection algorithm One fi_info structure per rail TBD: recovery fallback 24
  25. ACCELERATORS AND HETEROGENEOUS MEMORY CPU Memory PMEM (Smart) NICPeer Device (GPU) FPGA Device Memory Device Memory APIs assume memory mapped regions May not want to write data through CPU caches Memory regions may not be mapped Results may be cached by NIC for long transactions CPU load/stores Same coherency domain Programmable offload capabilities and flow processing May need to sync results with CPU 25 Accelerations may be available over the fabric Analyzing
  27. DOWNLOAD LIBFABRIC AND FABTESTS 27 – redirects to github Link to all releases Link to validation tests (fabtests) Man pages (scroll down)
  28. INSTALLATION 28 $ tar xvf libfabric-1.7.0.tar.bz2 $ cd libfabric-1.7.0 $ ./configure $ make $ sudo make install Follows common conventions Github only has tar files RPMs from distro or OFED • By default will auto-detect which providers are usable • Can enable/disable providers • e.g. --enable-verbs=yes • --enable-debug • Development build • --help
  29. FI_INFO 29 $ fi_info $ fi_info –l (--list) $ fi_info –v (--verbose) $ fi_info --env libfabric utility program Condensed view of usable providers List all installed providers Detailed attributes of usable providers List all environment variables Description and effect, default values FI_LOG_LEVEL=[debug | info | warn | …] FI_PROVIDER – enable/disable providers FI_PROVIDER=[tcp,udp,…] FI_PROVIDER=^[udp,…] FI_HOOK=[perf]
  31. FI_GETINFO struct fi_info *fi_allocinfo(void); int fi_getinfo( uint32_t version, const char *node, const char *service, uint64_t flags, struct fi_info *hints, struct fi_info **info); void fi_freeinfo( struct fi_info *info); 31 struct fi_info { struct fi_info *next; uint64_t caps; uint64_t mode; uint32_t addr_format; size_t src_addrlen; size_t dest_addrlen; void *src_addr; void *dest_addr; fid_t handle; struct fi_tx_attr *tx_attr; struct fi_rx_attr *rx_attr; struct fi_ep_attr *ep_attr; struct fi_domain_attr*domain_attr; struct fi_fabric_attr*fabric_attr; }; API version ~getaddrinfo app needs API semantics needed, and provider requirements for using them Detailed object attributes
  32. CAPABILITY AND MODE BITS 32 • Desired services requested by app • Primary – app must request to use • E.g. FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC • Secondary – provider can indicate availability • E.g. FI_SOURCE, FI_MULTI_RECV Capabilities • Requirements placed on the app • Improves performance when implemented by application • App indicates which modes it supports • Provider clears modes not needed • Sample: FI_CONTEXT, FI_LOCAL_MR Modes
  33. ATTRIBUTES struct fi_fabric_attr { struct fid_fabric *fabric; char *name; char *prov_name; uint32_t prov_version; uint32_t api_version; }; 33 struct fi_domain_attr { struct fid_domain *domain; char *name; enum fi_threading threading; enum fi_progress control_progress; enum fi_progress data_progress; enum fi_resource_mgmt resource_mgmt; enum fi_av_type av_type; int mr_mode; /* provider limits – fields omitted */ ... uint64_t caps; uint64_t mode; uint8_t *auth_key; ... }; Provider details Can also use env var to filter Already opened resource (if available) How resources are allocated among threads for lockless access Provider protects against queue overruns Do app threads drive transfers Secure communication (job key)
  34. ATTRIBUTES struct fi_ep_attr { enum fi_ep_type type; uint32_t protocol; uint32_t protocol_version; size_t max_msg_size; size_t msg_prefix_size; size_t max_order_raw_size; size_t max_order_war_size; size_t max_order_waw_size; uint64_t mem_tag_format; size_t tx_ctx_cnt; size_t rx_ctx_cnt; size_t auth_key_size; uint8_t *auth_key; }; 34 Indicates interoperability Order of data placement between two messages Default, shared, or scalable
  35. ATTRIBUTES struct fi_tx_attr { uint64_t caps; uint64_t mode; uint64_t op_flags; uint64_t msg_order; uint64_t comp_order; size_t inject_size; size_t size; size_t iov_limit; size_t rma_iov_limit; }; 35 struct fi_rx_attr { uint64_t caps; uint64_t mode; uint64_t op_flags; uint64_t msg_order; uint64_t comp_order; size_t total_buffered_recv; size_t size; size_t iov_limit; }; Are completions reported in order “Fast” message size Can messages be sent and received out of order
  36. THANK YOU Sean Hefty OFIWG Co-Chair • All things libfabric: • • Meetings: • • • Mailing list •
  37. LEGAL DISCLAIMER & OPTIMIZATION NOTICE 37 Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804  Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit  INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.  Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
  38. MEMORY MONITOR AND REGISTRATION CACHE Notification Queue Notification Queue Notification Queue Memory Monitor Core Monitor ‘Plug-in’ MR Map MR MR MR Registration Cache LRU List Custom Limits Usage Stats Provider Merges overlapping regions events subscribe Driver notification, hook alloc/free, provider specific Tracks active usage Internal API Get/put MRs Callbacks to add/delete MRs A generic solution is desired here 38
  39. PERSISTENT MEMORY User Commit complete RMA Write Persistent Memory User Register PMEM PMEM MR  Keep implementation agnostic • Handle offload and on-load models • Support multi-rail • Minimize state footprint High-availability model (v1.6) Evolve APIs to support other usage models Documentation limits use case New completion semantic  Exploration • Byte addressable or object aware • Single or multi-transfer commit • Advanced operations (e.g. atomics) Work with SNIA (Storage Networking Industry Association) 39