Advertisement
WN Memory Tiering WP Mar2023.pdf
WN Memory Tiering WP Mar2023.pdf
WN Memory Tiering WP Mar2023.pdf
WN Memory Tiering WP Mar2023.pdf
Advertisement
WN Memory Tiering WP Mar2023.pdf
WN Memory Tiering WP Mar2023.pdf
WN Memory Tiering WP Mar2023.pdf
Upcoming SlideShare
OpenShift Container Platform 4.12 Release NotesOpenShift Container Platform 4.12 Release Notes
Loading in ... 3
1 of 7
Advertisement

More Related Content

Advertisement

WN Memory Tiering WP Mar2023.pdf

  1. ©2023 Wheeler’s Network Page 1 The Evolution of Memory Tiering at Scale By Bob Wheeler, Principal Analyst March 2023 www.wheelersnetwork.com
  2. 1 The Evolution of Memory Tiering at Scale ©2023 Wheeler’s Network With first-generation chips now available, the early hype around CXL is giving way to realistic performance expectations. At the same time, software support for memory tiering is advancing, building on prior work around NUMA and persistent memory. Finally, operators have deployed RDMA to enable storage disaggregation and high-performance workloads. Thanks to these advancements, main-memory disaggregation is now within reach. Enfabrica sponsored the creation of this white paper, but the opinions and analysis are those of the author. Tiering Addresses the Memory Crunch Memory tiering is undergoing major advancements with the recent AMD and Intel server-processor introductions. Both AMD’s new Epyc (codenamed Genoa) and Intel’s new Xeon Scalable (codenamed Sapphire Rapids) introduce Compute Express Link (CXL), marking the beginning of new memory- interconnect architectures. The first generation of CXL-enabled processors handle Revision 1.1 of the specification, however, whereas the CXL Consortium released Revision 3.0 in August 2022. When CXL launched, hyperbolic statements about main-memory disaggregation appeared, ignoring the realities of access and time-of-flight latencies. With first-generation CXL chips now shipping, customers are left to address requirements for software to become tier-aware. Operators or vendors must also develop orchestration software to manage pooled and shared memory. In parallel with software, the CXL-hardware ecosystem will take years to fully develop, particularly CXL 3.x com- ponents including CPUs, GPUs, switches, and memory expanders. Eventually, CXL promises to mature into a true fabric that can connect CPUs and GPUs to shared memories, but network-attached memory still has a role. As Figure 1 shows, the memory hierarchy is becoming more granular, trading access latency against capacity and flexibility. The top of the pyramid serves the performance tier, where hot pages must be stored for maximum performance. Cold pages may be demoted to the capacity tier, which storage devices traditionally served. In recent years, however, developers have optimized software to improve performance when pages reside in different NUMA domains in multi-socket servers as well as in persistent (non-volatile) memories such as Intel’s Optane. Although Intel discontinued Optane development, its large software investment still applies to CXL-attached memories. FIGURE 1. MEMORY HIERARCHY (Data source: University of Michigan and Meta Inc.)
  3. 2 The Evolution of Memory Tiering at Scale ©2023 Wheeler’s Network Swapping memory pages to SSD introduces a massive performance penalty, creating an opportunity for new DRAM-based capacity tiers. Sometimes referred to as “far memory,” this DRAM may reside in another server or in a memory appliance. Over the last two decades, software developers advanced the concept of network-based swap, which enables a server to access remote memory located in another server on the network. By using network interface cards that support remote DMA (RDMA), system architects can reduce the access latency to network-attached memory to less than four microseconds, as Figure 1 shows. As a result, network swap can greatly improve the performance of some workloads compared with traditional swap to storage. Memory Expansion Drives Initial CXL Adoption Although it’s little more than three years old, CXL has already achieved industry support exceeding that of previous coherent-interconnect standards such as CCIX, OpenCAPI, and HyperTransport. Crucially, AMD supported and implemented CXL despite Intel developing the original specification. The growing CXL ecosystem includes memory controllers (or expanders) that connect DDR4 or DDR5 DRAM to a CXL-enabled server (or host). An important factor in CXL’s early adoption is its reuse of the PCI Express physical layer, enabling I/O flexibility without adding to processor pin counts. This flexibility extends to add-in cards and modules, which use the same slots as PCIe devices. For the server designer, adding CXL support requires only the latest Epyc or Xeon processor and some attention to PCIe-lane assignments. The CXL specification defines three device types and three protocols required for different use cases. Here, we focus on the Type 3 device used for memory expansion, and the CXL.mem protocol for cache- coherent memory access. All three device types require the CXL.io protocol, but Type 3 devices use this only for configuration and control. Compared with CXL.io as well as PCIe, the CXL.mem protocol stack uses different link and transaction layers. The crucial difference is that CXL.mem (and CXL.cache) adopt fixed-length messages, whereas CXL.io uses variable-length packets like PCIe. In Revisions 1.1 and 2.0, CXL.mem uses a 68-byte flow-control unit (or flit), which handles a 64-byte cache line. CXL 3.0 adopts the 256-byte flit introduced in PCIe 6.0 to accommodate forward-error correction (FEC), but it adds a latency-optimized flit that splits error checking (CRC) into two 128-byte blocks. Fundamentally, CXL.mem brings load/store semantics to the PCIe interface, enabling expansion of both memory bandwidth and capacity. As Figure 2 shows at left, the first CXL use cases revolve around memory expansion, starting with single-host configurations. The simplest example is a CXL memory module, such as Samsung's 512GB DDR5 memory expander with a PCIe Gen5 x8 interface in an EDSFF form factor. This module uses a CXL memory controller from Montage Technology, and the vendors claim support for CXL 2.0. Similarly, Astera Labs offers a DDR5 controller chip with a CXL 2.0 x16 interface. The company developed a PCIe add-in card combining its Leo controller chip with four RDIMM slots that handle up to a combined 2TB of DDR5 DRAM. Unloaded access latency to CXL-attached DRAM should be around 100ns greater than that of DRAM attached to a processor’s integrated memory controllers. The memory channel appears as a single logical device (SLD), which can be allocated to only a single host. Memory expansion using a single processor and SLD represents the best case for CXL-memory performance, assuming a direct connection without intermediate devices or layers such as retimers and switches.
  4. 3 The Evolution of Memory Tiering at Scale ©2023 Wheeler’s Network FIGURE 2. CXL 1.1/2.0 USE CASES The next use case is pooled memory, which enables flexible allocation of memory regions to specific hosts. In pooling, memory is assigned and accessible to only a single host—that is, a memory region is not shared by multiple hosts simultaneously. When connecting multiple processors or servers to a memory pool, CXL enables two approaches. The original approach added a CXL switch component between the hosts and one or more expanders (Type 3 devices). The downside of this method is that the switch adds latency, which we estimate at around 80ns. Although customers can design such a system, we do not expect this use case will achieve high-volume adoption, as the added latency decreases system performance. An alternative approach instead uses a multi-headed (MH) expander to directly connect a small number of hosts to a memory pool, as shown in the center of Figure 2. For example, startup Tanzanite Silicon Solutions demonstrated an FPGA-based prototype with four heads prior to its acquisition by Marvell, which later disclosed a forthcoming chip with eight x8 hosts ports. These multi-headed controllers can form the heart of a memory appliance offering a pool of DRAM to a small number of servers. The command interface for managing an MH expander wasn’t standardized until CXL 3.0, however, meaning early demonstrations used proprietary fabric management. CXL 3.x Enables Shared-Memory Fabrics Although it enables small-scale memory pooling, CXL 2.0 has numerous limitations. In terms of topology, it’s limited to 16 hosts and a single-level switch hierarchy. More important for connecting GPUs and other accelerators, each host supports only a single Type 2 device, which means CXL 2.0 can’t be used to build a coherent GPU server. CXL 3.0 enables up to 16 accelerators per host, allowing it to serve as a standardized coherent interconnect for GPUs. It also adds peer-to-peer (P2P) communications, multi-level switching, and fabrics with up to 4,096 nodes. Whereas memory pooling enables flexible allocation of DRAM to servers, CXL 3.0 enables true shared memory. The shared-memory expander is called a global fabric-attached memory (G-FAM) device, and it allows multiple hosts or accelerators to coherently share memory regions. The 3.0 specification also adds up to eight dynamic capacity (DC) regions for more granular memory allocation. Figure 3 shows a simple example using a single switch to connect an arbitrary number of hosts to shared memory. In this case, either the hosts or the devices may manage cache coherence.
  5. 4 The Evolution of Memory Tiering at Scale ©2023 Wheeler’s Network FIGURE 3. CXL 3.X SHARED MEMORY For an accelerator to directly access shared memory, however, the expander must implement coher- ence with back invalidation (HDM-DB), which is new to the 3.0 specification. In other words, for CXL- connected GPUs to share memory, the expander must implement an inclusive snoop filter. This approach introduces potential blocking, as the specification enforces strict ordering for certain CXL.mem trans- actions. The shared-memory fabric will experience congestion, leading to less-predictable latency and the potential for much greater tail latency. Although the specification includes QoS Telemetry features, host-based rate throttling is optional, and these capabilities are unproven in practice. RDMA Enables Far Memory As CXL fabrics grow in size and heterogeneity, the performance concerns expand as well. For example, putting a switch in each shelf of a disaggregated rack is elegant, but it adds a switch hop to every transaction between different resources (compute, memory, storage, and network). Scaling to pods and beyond adds link-reach challenges, and even time-of-flight latency becomes meaningful. When multiple factors cause latency to exceed 600ns, system errors may occur. Finally, although load/store semantics are attractive for small transactions, DMA is generally more efficient for bulk-data transfers such as page swapping or VM migration. Ultimately, the coherency domain need be extended only so far. Beyond the practical limits of CXL, Ethernet can serve the need for high-capacity disaggregated memory. From a data-center perspective, Ethernet’s reach is unlimited, and hyperscalers have scaled RDMA-over-Ethernet (RoCE) networks to thousands of server nodes. Operators have deployed these large RoCE networks for storage disaggregation using SSDs, however, not DRAM. Figure 3 shows an example implementation of memory swap over RDMA, in this case, the Infiniswap design from the University of Michigan. The researchers’ goal was to disaggregate free memory across servers, addressing memory underutilization, also known as stranding. Their approach used off-the- shelf RDMA hardware (RNICs) and avoided application modification. The system software uses an Infiniswap block device, which appears to the virtual memory manager (VMM) as conventional storage. The VMM handles the Infiniswap device as a swap partition, just as it would use a local SSD partition for page swapping.
  6. 5 The Evolution of Memory Tiering at Scale ©2023 Wheeler’s Network FIGURE 4. MEMORY SWAP OVER ETHERNET The target server runs an Infiniswap daemon in user space, handling only the mapping of local memory to remote block devices. Once memory is mapped, read and write requests bypass the target server’s CPU using RDMA, resulting in a zero-overhead data plane. In the researchers’ system, every server loaded both software components so they could serve as both requestors and targets, but the concept extends to a memory appliance that serves only the target side. The University of Michigan team built a 32-node cluster using 56Gbps InfiniBand RNICs, although Ethernet RNICs should operate identically. They tested several memory-intensive applications, including VoltDB running the TPC-C benchmark and Memcached running Facebook workloads. With only 50% of the working set stored in local DRAM and the remainder served by network swap, VoltDB and Memcached delivered 66% and 77%, respectively, the performance of the same workloads with the complete working set in local DRAM. By comparison, disk-based swap with the 50% working set delivered only 4% and 6%, respectively, of baseline performance. Thus, network swap provided an order of magnitude speedup compared with swap to disk. Other researchers, including teams at Alibaba and Google, advocate for modifying the application to directly access a remote memory pool, leaving the operating system unmodified. This approach can deliver greater performance than the more generalized design presented by the University of Michigan. Hyperscalers have the resources to develop custom applications, whereas the broader market requires support for unmodified applications. Given the implementation complexities of network swap at scale, the application-centric approach will likely be deployed first. Either way, Ethernet provides low latency and overhead using RDMA, and its reach easily handles row- or pod-scale fabrics. The fastest available Ethernet-NIC ports can also deliver enough bandwidth to handle one DDR5 DRAM channel. When using a jumbo frame to transfer a 4KB memory page, 400G Ethernet has only 1% overhead, yielding 49GB/s of effective bandwidth. That figure well exceeds the 31GB/s of effective bandwidth delivered by one 64-bit DDR5-4800 channel. Although 400G RNICs represent the leading edge, Nvidia shipped its ConnectX-7 adapter in volume during 2022. The Long Road to Memory Fabrics Cloud data centers succeeded in disaggregating storage and network functions from CPUs, but main- memory disaggregation remained elusive. Pooled memory was on the roadmap for Intel’s Rack-Scale Architecture a decade ago but never came to fruition. The Gen-Z Consortium formed in 2016 to pursue
  7. 6 The Evolution of Memory Tiering at Scale ©2023 Wheeler’s Network a memory-centric fabric architecture, but system designs reached only the prototype stage. History tells us that as industry standards add complexity and optional features, their likelihood of volume adoption drops. CXL offers incremental steps along the architectural-evolution path, allowing the technology to ramp quickly while offering future iterations that promise truly composable systems. Workloads that benefit from memory expansion include in-memory databases such as SAP HANA and Redis, in-memory caches such as Memcached, and large virtual machines, as well as AI training and inference, which must handle ever-growing large-language models. These workloads fall off a performance cliff when their working sets don’t fully fit in local DRAM. Memory pooling can alleviate the problem of stranded memory, which impacts the capital expenditures of hyperscale data-centers operators. A Microsoft study, detailed in a March 2022 paper, found that up to 25% of server DRAM was stranded in highly utilized Azure clusters. The company modeled memory pooling across different numbers of CPU sockets and estimated it could reduce overall DRAM requirements by about 10%. The case for pure-play CXL 3.x fabric adoption is less compelling, in part because of GPU-market dynamics. Current data-center GPUs from Nvidia, AMD, and Intel implement proprietary coherent interconnects for GPU-to-GPU communications, alongside PCIe for host connectivity. Nvidia’s top- end Tesla GPUs already support memory pooling over the proprietary NVLink interface, solving the stranded-memory problem for high-bandwidth memory (HBM). The market leader is likely to favor NVLink, but it may also support CXL by sharing lanes (serdes) between the two protocols. Similarly, AMD and Intel could adopt CXL in addition to Infinity and Xe-Link, respectively, in future GPUs. The absence of disclosed GPU support, however, creates uncertainty around adoption of advanced CXL 3.0 features, whereas the move to PCIe Gen6 lane rates for existing use cases is undisputed. In any case, we expect it will be 2027 before CXL 3.x shared-memory expanders achieve high-volume shipments. In the meantime, multiple hyperscalers adopted RDMA to handle storage disaggregation as well as high-performance computing. Although the challenges of deploying RoCE at scale are widely recognized, these large customers are capable of solving the performance and reliability concerns. They can extend this deployed and understood technology into new use cases, such as network-based memory disaggregation. Research has demonstrated that a network-attached capacity tier can deliver strong performance when system architects apply it to appropriate workloads. We view CXL and RDMA as complementary technologies, with the former delivering the greatest bandwidth and lowest latency, whereas the latter offering greater scale. Enfabrica developed an architecture it calls an Accelerated Compute Fabric (ACF), which collapses CXL/PCIe–switch and RNIC functions into a single device. When instantiated in a multiterabit chip, the ACF can connect coherent local memory while scaling across chassis and racks using up to 800G Ethernet ports. Crucially, this approach removes dependencies on advanced CXL features that will take years to reach the market. Data-center operators will take multiple paths to memory disaggregation, as each has different priorities and unique workloads. Those with well-defined internal workloads will likely lead, whereas others that prioritize public-cloud instances are apt to be more conservative. Early adopters create opportunities for vendors that can solve a particular customer’s most pressing need. Bob Wheeler is an independent industry analyst covering semiconductors and networking for more than two decades. He is currently principal analyst at Wheeler’s Network, established in 2022. Previously, Wheeler was a principal analyst at The Linley Group and a senior editor for Microprocessor Report. Joining the company in 2001, he authored articles, reports, and white papers covering a range of chips including Ethernet switches, DPUs, server processors, and embedded processors, as well as emerging technologies. Wheeler’s Network offers white papers, strategic consulting, roadmap reviews, and custom reports. Our free blog is available at www.wheelersnetwork.com.
Advertisement