Boost Your AI Workload Performance using CXL Memory

Boost Your AI Workload
Performance using CXL
Memory
Anil Godbole, Sr. Datacenter Prod. Planning & Mktg Mgr
Mar 2025

Intel Confidential 2
1 Source: Intel. Results may vary.
Compute Core Count Keeps Increasing
 Needed to keep up with
memory intensive workloads
 Examples
• Virtualized servers
• In-memory data bases
• AI/ML
• Many others…

Value Prop of CXL-attached Memory
Increased
Memory
Capacity
Increased Memory
Bandwidth
Lower Memory
TCO
Improve processor perf
• Faster execution
• Run more VMs/processes
Benefitting Workloads
• Virtualized Servers
• In-memory Databases
• AI / ML
• HPC (High Perf
Computing)
• Media (CDN, Video 8K)
• Medical (Genomics)
Improve processor’s
memory bandwidth using
address interleaving
Benefitting Workloads
• AI/ML
• HPC
• Non-relational Databases
Avoid expensive 3DS
DIMMs
• Use standard DIMM
capacities for native &
CXL
Use lower-cost memory
media on CXL
• DDR4
• (Future) Persistent
memory
Memory Pooling
• For optimal provision
of local DRAM on
servers
CPU
Native DDR5
EDSFF E3
or E1
PCI CEM/Custom Board

Intel Xeon Roadmap Fully Aligned with CXL Roadmap
Intel CXL Enablement Roadmap
 Supports CXL v1.1
spec
 Leadership in CXL
ecosystem
enablement
4th
& 5th
Gen Intel®
Xeon®
Gen4 (SPR) / Gen5(EMR)
Eagle Stream Platform)
 Supports CXL v2.0
spec
 Enhanced support for
CXL Memory
 Memory Pooling for
PoC (Proof of concept)
6th
Gen Intel® Xeon® CPU
Gen6 (GNR, SRF)
Birch Stream Platform*
 Support for CXL v3.X
spec
Future Gen Intel® Xeon®
CPU
*Recommend using SKUs at HCC or
above

Intel Xeon Supported CXL Memory Modes
H/W-controlled Modes
CPU
Direct Attach DDR5
EDSFF E3
or E1
PCI CEM Add-in-card
S/W-controlled Modes
(1) Intel Flat Memory Mode (on BHS)
 For system memory expansion
 Potential TCO savings with
DDR4 reuse on CXL modules
(2) Hetero Interleave DRAM and CXL
memory address space*
 For system memory capacity &
b/w expansion
 Lowers average latency
1) S/W (Hypervisor/OS/App) assisted
tiering (Linear addressing)
 For system memory expansion
 S/W (O/S, Middleware or
Application) controlled Hot/cold
page movement
2) S/W based memory Interleaving**
 For system memory capacity b/w
expansion
 S/W controlled page interleaving
HW-controlled tiering feature unique to Intel Xeon CPUs;
Completely independent of O/S version & data-tiering capabilities
* Recommended for W/Ls with good mix of RDs/WRs; Not supported on SRF CPUs * *Requires Linux kernel v6.9 or above

Intel Hetero-Interleave Mode
 Completely H/w-controlled mode
• Increase memory capacity & bandwidth
• DDR+CXL Memory recognized as a single Numa mode
 No page movements
 No dependence on O/S-based tiering techniques
 System address space ‘striped’ across
• 8 / 12 native DRAM channels*
• 2 CXL links attached memory ( ~= 4x DDR5 channels)
 Total = 12-way / 16-way interleave
Results in higher system memory
bandwidth^
DDR5 DIMM
DDR5 on Buffer
Buff
Buff
Xeon 6 UPI
8x/ 12x
DDR 5
channels
x16 CXL1.1
x16 CXL1.1
2-way ch
interleave
4-way
8-way /
12-way*
2-way ch
interleave
Intel’s Hetero-Interleave mode beneficial to b/w-hungry WLs like AI /
ML
No dependency on O/S version/capability
^ Recommended for W/Ls with good mix of RDs/WRs; Not supported on SRF CPUs * 8 ch on X6500/6700 & 12 ch on X6900

 23% speedup w/ hetero mode(12ch) CXL
memory
 Hetero mode memory BW Utilization
• Read/Write ratio: 2:1
Performance
native-only 12ch mode
10
15
20
25
30
35
40
45
100%
123% *
BoneAgeAssessment Perf Speedup Hetero
Mode
Throughput(fps)
higher is better
Localization
Network
Regressio
nNetwork
Heatmap
Network
gender Bone Age
Assessment
key points
heatmap
Input Output
AI Inference
*123% is using production CXL silicon. Demo is running pre-
production silicon that shows 112% speedup.
AI-based Image Analysis
EM
R

S/W-Assisted B/W-Weighted Memory Interleaving
 SW (Hypervisor/OS/App) responsible for tiering &
interleaving
 Systems boots as two-tier memory (Near & Far)
 S/w ‘stripes’ pages between native & CXL memory
• Uses page-table entries to assign physical addresses to
virtual address pages
 Page-striping ratio (‘M:N’)
• No. of pages in native DRAM / No. of pages in CXL
memory)
• Typically based on ratio of native DRAM memory wrt
CXL memory b/w
• But completely flexible for S/W to choose
 No page movement involved
• Pages remain ‘pinned’ in their respective memories
Feature Up-streamed in Linux (v6.9+)
https://community.intel.com/t5/Blogs/Tech-Innovation/Data-Center/Improve-your-HPC-and-AI-workload-performance-by-increasing/post/1647882

Bandwidth Expansion with DDR5 + CXL Memory
Interleaving weights given by (M,N)
pairs
Chart with Intel Xeon 6900 with Micron DDR5 DIMMs &
CXL CZ-120 modules

Demo Setup - Vector Search
(FAISS)
SYSTEM CONFIGURATION
Platform Intel Avenue City
CPU family Xeon-6 GNR-AP with 128 physical cores
Native
DRAM
Micron DDR5-64GB (6400MTs)
(12 modules ~ 768 GB)
CXL
Memory
Micron CZ122 – 128GB * 8
(8 modules E3.S form factor ~ 1TB)
OS Red Hat Enterprise Linux 9.4
Kernel 6.11.6 (weighted interleaving supported)
Dataset Microsoft Turing-ANNS (1B points, dim=100, float32)
Framewor
k
FAISS-CPU 1.8.0 (https://faiss.ai/)
Index: OPQ128_256-IVF65536_HNSW32-PQ128x4fsr

Vector Search (FAISS) Workload
https://faiss.ai
Vector search is an important workload commonly used
in RAG (Retrieval-Augmented Generation) systems.
It enables enable efficient access to relevant information
and enhance the quality of generated responses, making
AI interactions more accurate and contextually aware.

Vector Search (FAISS): 23% Perf. gain with DDR5 + CXL (Micron
CZ122)
Memory
Used
Time
(ms / query)
DDR5 only 0.545
DDR5 + CXL 0.442
23% faster search
time

Intel Xeon
6700

LLM Inference (Llama) Demo Setup
SYSTEM CONFIGURATION
Platform Intel Avenue City
CPU family GNR-AP with 120 physical cores total
SNC 3 mode (Sub-NUMA Clustering)
Native
DRAM
Micron DDR5 – 128GB – 5600 MTs
(total 12 sticks – 12 x 40 GB/s = 480 GB/s) (100% RDs)
CXL
Memory
Micron CZ-120 – 128 GB
(8 modules EDSFF form factor)
Total B/W: 8 x 26 GB/s = 208 GB/s (100% RDs)
(Ratio chosen: 5:2)
OS Red Hat Enterprise Linux
Kernel 6.8 rc-5 – with weighted s/w interleaving enabled
Model llama-2-13b-chat-hf (quantized to int8)
Framework Intel neural-speed framework with AMX enabled
Txt-GUI based user input/output

30% Performance gain with DDR5 + CXL (Micron
CZ-120) Memory Used Performance
DDR5 only 5.27
tokens/second
DDR5 + CXL 6.88
tokens/second
23% more tokens/sec

Summary
 CPUs will play a big role in the AI revolution in the coming years
 There are many AI workloads like RAG, small LLM inferencing
where CPUs can do the job more economically
• Without needing a GPU
 Modern CPUs offer many features like AMX accelerators & CXL
interfaces which can help for efficient execution of AI workloads
 Call to Action:
 Check out Intel Xeon 600/6700/6500 CPUs and featured IHV CXL
memories for boosting your AI performance today

Boost Your AI Workload Performance using CXL Memory

More Related Content

What's hot

Similar to Boost Your AI Workload Performance using CXL Memory

More from AI Infra Forum

Recently uploaded

Boost Your AI Workload Performance using CXL Memory

Editor's Notes