Ravi Gummaluri, Director, CXL System Architecture at Micron describes use cases for memory expansion with tiered DRAM and CXL memory, along with performance data.
UiPath Test Automation using UiPath Test Suite series, part 3
Q1 Memory Fabric Forum: Memory expansion with CXL-Ready Systems and Devices
1. Memory Expansion
with CXL Ready
Systems and Devices
Presenter:
Ravi Kiran Gummaluri
Micron Technology
2. Agenda
• Memory demand and scaling challenges
• CXL memory expansion
• Capacity expansion solutions
• Database performance analysis on AMD Platform
• Bandwidth expansion solutions
• AI inference performance analysis on Intel Platform
• Conclusions and Next steps
3. Memory Demand and Scaling challenges
3
Growing demand for Memory need in data center applications . (~26 % yoy )
Memory Latency -> is only improving 1.1 times every two years.
Processor speed -> has been doubling every two years.
DRAM is not scaling -> Memory Capacity is doubling every four years.
Increased TCO for Data Centers -> Memory is ~ 50% of the overall server cost .
How do we solve increased Memory Bandwidth , Capacity requirements and reduce TCO ?
Figure 1 : Source: https://www.statista.com/statistics/871513/worldwide-data-created/
Figure 3 : Source: Based on capacity and core counts from publicly available AMD and Intel datasheets, and public statements.
Figure 1: Growing memory usage Figure 2: Memory wall Figure 3: Memory capacity Vs CPU cores
4. CXL Memory expansion
CXL Memory Expansion
Cache-line granular access semantics.
CXL-Memory appears to a system as a CPU-less NUMA node. (Not
dependent on CPU Arch)
Hot Pluggable memory
Works with various form factors E1.S, E3.S , E5.S,Add on Card etc
Interoperable with various memory types (DDR4, DDR5, LPDDR5, NVM ..)
CXL Memory Capacity Expansion
CXL Direct attached Memory Tiering
1. Application Transparent
OS Managed
User Space Library
2. Application Managed
Application Aware (ex: libnuma)
Modified (ex : libmemkind)
CXL Switch / Fabric attached Memory Tiering
Another Memory tier added to system with higher latencies.
CXL Memory Bandwidth Expansion
CXL Heterogenous interleave solutions
1. Hardware based Interleave
2. Software and HW heterogenous interleave.
3. Software based NUMA interleave.
4
Figure : Memory Hierarchy
6. TPC-H: DRAM Vs Tiered memory(DRAM+CXL)
6
CXL can provide better performance for capacity intensive workloads
7. HW Heterogenous Interleave
System Address map will be interleaved between
Local DRAM and CXL memory
Pros
Easy to configure
Cons
Kernel/OS cannot manage memory allocations.
⎻ Affects kernel memory.
⎻ Hides the NUMA topology from the OS.
Fixed configuration : Not scalable for all workloads
CMM capacity will be restricted to align with Local
DRAM capacity.
Figure : HW Heterogenous interleave
8. HW + SW Heterogenous Interleave
HW : Supports associating DRAM channels to
different NUMA domains .
SW : Interleave 4(Local ):1(CXL) NUMA domain
using numactl .
NPS4 :Each socket is partitioned into 4 NUMA
domains. Each NUMA domain has 3 memory
channels.
Pros
NUMA topology is enabled.
Kernel/OS can manage the memory allocations
Overcomes capacity limitations imposed by HW
interleave solution .
Cons
Fixed configuration : Not scalable for all workloads . Figure : HW + SW 4:1 Interleave
9. SW Heterogenous Interleave
Figure : SW Interleave with weights
Local DRAM
CXL MEMORY
Node 1
Socket 0
Application
requesting
100-pages
80-pages
20-pages
Memory allocations performed according to per-node
weights
Pros
Scalable : Not fixed configuration
o Application can configure different weights according to BW
requirements .
o This only applies when explicitly enabled for a job.
NUMA topology is enabled.
Kernel/OS can manage the memory allocations
Overcomes capacity limitations imposed by HW
interleave solution .
Cons
CXL Switch / Fabric attached Memory Tier cannot take
advantage of this configuration.
Node 0
10. LLM Performance Optimization with Micron’s CXL Memory SW interleaving
10
CXL can provide better performance for bandwidth intensive workloads
11. Conclusion / Next Steps
11
Conclusions :
CXL memory expansion can provide a solution to increased Memory Bandwidth and
Capacity requirements .
CXL memory can help in bandwidth expansion using SW interleaving between DDR and
CXL memory. Bandwidth sensitive workloads, Such as AI and HPC benefit from this.
CXL memory when introduced as tiered memory can help in increasing memory capacity
and reducing latency impact of Storage media . Capacity sensitive workloads , Such as
database and data analytics applications can benefit from this.
Next Steps :
Application aware and optimized page allocation algorithms can further improve system
performance by utilizing various memory tiers and media characteristics .
CXL memory pooling and Fabric attached memory can help further in defining various
memory tiers to reduce system TCO.
12. Introducing Micron CZ120 CXL Memory Module
Delivering Capacity, Bandwidth, Flexibility
128GB / 256GB
Up to 2TB incremental server capacity supporting CXL 2.0
36GB/s
Up to 34% increased server memory bandwidth
memory bandwidth per module using PCIe® Gen5 x8
E3.S 2T x8
Industry-standard form factor for broad deployment
1. By adding 8x256GB CZ120s, system limitations apply
2. Memory Latency Checker bandwidth compared to 12-channel 4800MT/s RDIMM server
2
1