Server I/O Networks Past, Present, and Future Renato Recio Distinguished Engineer Chief Architect, IBM eServer I/O Copyrighted, International Business Machines Corporation, 2003
Legal Notices All statements regarding future direction and intent for IBM, InfiniBand TM Trade Association, RDMA Consortium, or any other standard organization mentioned are subject to change or withdrawal without notice, and represent goals and objectives only. Contact your IBM local Branch Office or IBM Authorized Reseller for the full text of a specific Statement of General Direction. IBM may have patents or pending patent applications covering subject matter in this presentation. The furnishing of this presentation does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, 500 Columbus Avenue, Thornwood, NY 10594 USA. The information contained in this presentation has not been submitted to any formal IBM test and is distributed as is. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. The use of this information or the implementation of any techniques described herein is a customer responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. Customers attempting to adapt these techniques to their own environments do so at their own risk. The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: AIX, PowerPC, RS/6000, SP, S/390, AS/400, zSeries, iSeries, pSeries, xSeries, and Remote I/O. UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company, Limited. Ethernet is a registered trademark of Xerox Corporation. TPC-C, TPC-D, and TPC-H are trademarks of the Transaction Processing Performance Council. Infiniband TM is a trademark of the Infiniband TM Trade Association. Other product or company names mentioned herein may be trademarks or registered trademarks of their respective companies or organizations.
In other words… Regarding Industry Trends and Directions
IBM respects the copyright and trademark of other companies…
These slides represent my views:
Does not imply IBM views or directions.
Does not imply the views or directions of InfiniBand SM Trade Association, RDMA Consortium, PCI-SIG, or any other standard group.
Provide a new , very efficient, I/O communication model ,
that satisfies enterprise server requirements, and
can be used for I/O, cluster, and storage.
Enables middleware to communicate across a low latency, high bandwidth fabric, through messages queues, that can be accessed directly out of user space.
But… required a completely new infrastructure ,
(management, software, endpoint hardware, fabric switches, and links).
I/O adapter industry viewed IB’s model as too complex.
Sooo… I/O adapter vendors are staying on PCI,
IB may be used to attach high-end I/O to enterprise class servers.
Given current I/O attachment reality, enterprise class vendors will likely:
Continue extending their proprietary fabric(s), or
Tunnel PCI traffic through IB, and provide IB-PCI bridges.
I/O Expansion Network Comparison Interface checks, CRC Memory access controls Redundant paths Hot-plug and dynamic discovery Service levels, virtual channels Identifier based switched fabric Chip-chip, card-card connector, cable Multi-host, general Serial 1x, 4x, 12x 2.5 GHz 250 MB/s to 3 GB/s Native: Message based asynchronous operations (Send and RDMA ) Tunnel: PIO based sync. operations IB PCI-Express Interface checks, CRC No native memory access controls No redundant paths Hot-plug and dynamic discovery Traffic classes, virtual channels Unscheduled outage protection Schedule outage protection Service level agreement Self-management Memory mapped switched fabric Chip-chip, card-card connector, cable Single host, root Tree Connectivity Distance Topology Serial 1x, 4x, 8x, 16x 2.5 GHz 250 MB/s to 4 GB/s PIO based synchronous operations (network traversal for PIO Reads) Link widths Link frequency Bandwidth range Latency Performance
I/O Expansion Network Comparison… Continued 5 or 6.25 GHz (work in process) Verb enhancements 5 or 6.25 GHz (work in process) Mandatory interface checks, CRC Higher frequency links Advanced functions Next steps Standard mechanisms available End-point partitioning Standard mechanisms available New infrastructure IOEN, CAN, high-end I/O Attachment IB Cost PCI-Express Performed by host None No standard mechanism Host virtualization Network virtualization I/O virtualization Virtualization New chip core (macro) IOEN and I/O Attachment Infrastructure build up Fabric consolidation potential
An internet Offload Network Interface Controller (iONIC).
Supports one or more internet protocol suite offload services.
RDMA enabled NIC (RNIC)
An iONIC that supports the RDMA Service.
IP suite offload services, include, but are not limited to:
TCP/IP Offload Engine (TOE) Service
Remote Direct Memory Access (RDMA) Service
iSCSI Extensions for RDMA (iSER) Service
Transport Network Sockets over Ethernet Link Service NIC Mgt Host Sockets over TOE Service Sockets over RDMA Service TOE Drv TOE Service Library iONIC TCP IP Ethernet RDMA/DDP/MPA NIC Dvr RNIC Drv RDMA Service Library Only the Ethernet Link, TOE, and RDMA Services are shown. Sockets Application
Network Stack Offload – iONIC RDMA Service Overview
Verb consumer – Software that uses RDMA Service to communicate to other nodes.
Communication is thru verbs, that:
Manage connection state.
Manage memory and queue access.
Submit work to iONIC.
Retrieve work and events from iONIC.
RDMA Service Interface (RI) performs work on behalf of the consumer.
RI consists of:
Driver – Performs privileged functions.
Library – Performs user space functions.
RNIC – hardware adapter.
RNIC Driver/Library Verb consumer Verbs iONIC RDMA Service Data Engine Layer QP Context (QPC) RDMA/DDP/MPA/TCP/IP … RI CQ RQ SQ AE Memory Translation and Protection Table (TPT) SRQ
Presentation Server DB Client & Replication Web Application Server Business Function Server: OLTP & BI DB; HPC NC not useful at present due to XML & Java overheads
Sockets-level NC support beneficial
(5 to 6% performance gain for communication between App tier and business function tier)
(0 to 90% performance gain for communication between browser and web server)
Low-level (uDAPL, ICSC) support most beneficial
(4 to 50% performance gain for business function tier)
iSCSI, DAFS support beneficial
(5 to 50% gain for NFS/RDMA compared to NFS performance)
Legend Note: All tiers are logical; they can potentially run on the same server OS instance(s). Traditionally use Cluster Network. Client Tier Browser User Web Server Presentation Data Application Data Business Data
Use of proprietary cluster networks for high-end clusters will continue to decline.
Multi-platform cluster networks have already begun to gain significant share.
Standards-based cluster networks will become the dominant form.
0% 20% 40% 60% 80% 100% Standards Based Multi Platform Single Platform June 2002 November 2002 June 2000 Cluster Interconnect Technology Top 500 Supercomputers Top 100 Next 100 Last 100 Top 100 Next 100 Last 100 Top 100 Next 100 Last 100 * Source: Top 500 study by Tom Heller. *
Reduction in Process-Process Latencies 256 B and 8 KB Block LAN Process-Process Latencies Normalized LAN Process-Process Latencies 1 GigE 100 MFLOP=19us 10 GigE 100 MFLOP= 6us IB 100 MFLOP= 6us 1.2x lower 4.6x lower 8.4x lower 2.5x lower 3.0x lower 3.9x lower
Parallel SCSI and FC have very efficient path through O/S
Existing driver to hardware interface has been tuned for many years.
An efficient driver-HW interface model has been a key iSCSI adoption issue.
Next steps in iSCSI development:
Offload TCP/IP processing to the host bus adapter,
Provide switches that satisfy SAN latencies requirements,
Improve read and write processing overhead at the initiator and target.
CPU SCSI or FC FS API Application Stg. Adapter FS/LVM Stg Driver Parallel SCSI or FC CPU Application iSCSI Service in iONIC Adapter Drv iSCSI HBA iSCSI CPU FS API Application iSCSI Service in host NIC FS/LVM iSCSI TCP/IP NIC Driver FS API FS/LVM Stg Driver TCP/IP Ethernet P. Offload Ethernet .01 .1 1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 Transfer size in bytes CPU Instructions/Byte Parallel SCSI iSCSI Service in host iSCSI Service in iONIC
RDMA will significantly improve NAS server performance.
Host network stack processing will be offloaded to an iONIC.
Removes tcp/ip processing from host path.
Allows zero copy.
NAS (NFS with RDMA Extensions) protocols will exploit RDMA.
RDMA allows a file level access device to approach
block level access device performance levels.
Creating a performance discontinuity for storage.
NFS over ELS NIC NFS Extensions for RDMA over RDMA Service in iONIC CPU NFS API Application NIC NFS TCP/IP NIC Driver CPU Application RNIC RDMA/DDP NIC Driver NFS API NFS P. Offload Ethernet MPA/TCP IP/Ethernet .01 .1 1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 Transfer size in bytes CPU Instructions/Byte NFS over ELS NIC NFS over RNIC Parallel SCSI
Paced by higher frequency circuits, higher performance microprocessors, and larger fast-write and read cache memory.
SANs will gradually transition from FC to IP/Ethernet networks.
Motivated by TCO/complexity reduction.
Paced by availability of:
iSCSI with efficient TOE (possibly RNIC)
Lower latency switches
NAS will be more competitive against SAN.
Paced by RNIC availability.
* Sources: Product literature from 14 companies. Typically use a workload that is 100% read of 512 byte data; not a good measure of overall sustained performance, but it is a good measure of adapter/controller front-end throughput capability. Single Adapter/Controller Throughput SAN .001 .01 .1 1 10 1990 1995 2000 2005 GB/s SCSI FC Disk Head iSCSI/E .1 1 10 100 1000 1994 1998 2003 2008 K IOPS