Boost Your AI Workload
Performance using CXL
Memory
Anil Godbole, Sr. Datacenter Prod. Planning & Mktg Mgr
Mar 2025
Intel Confidential 2
1 Source: Intel. Results may vary.
Compute Core Count Keeps Increasing
 Needed to keep up with
memory intensive workloads
 Examples
• Virtualized servers
• In-memory data bases
• AI/ML
• Many others…
Intel Confidential 3
Value Prop of CXL-attached Memory
Increased
Memory
Capacity
Increased Memory
Bandwidth
Lower Memory
TCO
Improve processor perf
• Faster execution
• Run more VMs/processes
Benefitting Workloads
• Virtualized Servers
• In-memory Databases
• AI / ML
• HPC (High Perf
Computing)
• Media (CDN, Video 8K)
• Medical (Genomics)
Improve processor’s
memory bandwidth using
address interleaving
Benefitting Workloads
• AI/ML
• HPC
• Non-relational Databases
Avoid expensive 3DS
DIMMs
• Use standard DIMM
capacities for native &
CXL
Use lower-cost memory
media on CXL
• DDR4
• (Future) Persistent
memory
Memory Pooling
• For optimal provision
of local DRAM on
servers
CPU
Native DDR5
EDSFF E3
or E1
PCI CEM/Custom Board
Intel Confidential 4
Intel Xeon Roadmap Fully Aligned with CXL Roadmap
Intel CXL Enablement Roadmap
 Supports CXL v1.1
spec
 Leadership in CXL
ecosystem
enablement
4th
& 5th
Gen Intel®
Xeon®
Gen4 (SPR) / Gen5(EMR)
Eagle Stream Platform)
 Supports CXL v2.0
spec
 Enhanced support for
CXL Memory
 Memory Pooling for
PoC (Proof of concept)
6th
Gen Intel® Xeon® CPU
Gen6 (GNR, SRF)
Birch Stream Platform*
 Support for CXL v3.X
spec
Future Gen Intel® Xeon®
CPU
*Recommend using SKUs at HCC or
above
Intel Confidential 5
Intel Xeon Supported CXL Memory Modes
H/W-controlled Modes
CPU
Direct Attach DDR5
EDSFF E3
or E1
PCI CEM Add-in-card
S/W-controlled Modes
(1) Intel Flat Memory Mode (on BHS)
 For system memory expansion
 Potential TCO savings with
DDR4 reuse on CXL modules
(2) Hetero Interleave DRAM and CXL
memory address space*
 For system memory capacity &
b/w expansion
 Lowers average latency
1) S/W (Hypervisor/OS/App) assisted
tiering (Linear addressing)
 For system memory expansion
 S/W (O/S, Middleware or
Application) controlled Hot/cold
page movement
2) S/W based memory Interleaving**
 For system memory capacity b/w
expansion
 S/W controlled page interleaving
HW-controlled tiering feature unique to Intel Xeon CPUs;
Completely independent of O/S version & data-tiering capabilities
* Recommended for W/Ls with good mix of RDs/WRs; Not supported on SRF CPUs * *Requires Linux kernel v6.9 or above
Intel Confidential 6
Intel Hetero-Interleave Mode
 Completely H/w-controlled mode
• Increase memory capacity & bandwidth
• DDR+CXL Memory recognized as a single Numa mode
 No page movements
 No dependence on O/S-based tiering techniques
 System address space ‘striped’ across
• 8 / 12 native DRAM channels*
• 2 CXL links attached memory ( ~= 4x DDR5 channels)
 Total = 12-way / 16-way interleave
Results in higher system memory
bandwidth^
DDR5 DIMM
DDR5 on Buffer
Buff
Buff
Xeon 6 UPI
8x/ 12x
DDR 5
channels
x16 CXL1.1
x16 CXL1.1
2-way ch
interleave
4-way
8-way /
12-way*
2-way ch
interleave
Intel’s Hetero-Interleave mode beneficial to b/w-hungry WLs like AI /
ML
No dependency on O/S version/capability
^ Recommended for W/Ls with good mix of RDs/WRs; Not supported on SRF CPUs * 8 ch on X6500/6700 & 12 ch on X6900
Intel Confidential 7
 23% speedup w/ hetero mode(12ch) CXL
memory
 Hetero mode memory BW Utilization
• Read/Write ratio: 2:1
Performance
native-only 12ch mode
10
15
20
25
30
35
40
45
100%
123% *
BoneAgeAssessment Perf Speedup Hetero
Mode
Throughput(fps)
higher is better
Localization
Network
Regressio
nNetwork
Heatmap
Network
gender Bone Age
Assessment
key points
heatmap
Input Output
AI Inference
*123% is using production CXL silicon. Demo is running pre-
production silicon that shows 112% speedup.
AI-based Image Analysis
EM
R
Intel Confidential 8
S/W-Assisted B/W-Weighted Memory Interleaving
 SW (Hypervisor/OS/App) responsible for tiering &
interleaving
 Systems boots as two-tier memory (Near & Far)
 S/w ‘stripes’ pages between native & CXL memory
• Uses page-table entries to assign physical addresses to
virtual address pages
 Page-striping ratio (‘M:N’)
• No. of pages in native DRAM / No. of pages in CXL
memory)
• Typically based on ratio of native DRAM memory wrt
CXL memory b/w
• But completely flexible for S/W to choose
 No page movement involved
• Pages remain ‘pinned’ in their respective memories
Feature Up-streamed in Linux (v6.9+)
https://community.intel.com/t5/Blogs/Tech-Innovation/Data-Center/Improve-your-HPC-and-AI-workload-performance-by-increasing/post/1647882
Intel Confidential 9
Bandwidth Expansion with DDR5 + CXL Memory
Interleaving weights given by (M,N)
pairs
Chart with Intel Xeon 6900 with Micron DDR5 DIMMs &
CXL CZ-120 modules
Intel Confidential 10
Demo Setup - Vector Search
(FAISS)
SYSTEM CONFIGURATION
Platform Intel Avenue City
CPU family Xeon-6 GNR-AP with 128 physical cores
Native
DRAM
Micron DDR5-64GB (6400MTs)
(12 modules ~ 768 GB)
CXL
Memory
Micron CZ122 – 128GB * 8
(8 modules E3.S form factor ~ 1TB)
OS Red Hat Enterprise Linux 9.4
Kernel 6.11.6 (weighted interleaving supported)
Dataset Microsoft Turing-ANNS (1B points, dim=100, float32)
Framewor
k
FAISS-CPU 1.8.0 (https://faiss.ai/)
Index: OPQ128_256-IVF65536_HNSW32-PQ128x4fsr
Intel Confidential 11
Vector Search (FAISS) Workload
https://faiss.ai
Vector search is an important workload commonly used
in RAG (Retrieval-Augmented Generation) systems.
It enables enable efficient access to relevant information
and enhance the quality of generated responses, making
AI interactions more accurate and contextually aware.
Intel Confidential 12
Vector Search (FAISS): 23% Perf. gain with DDR5 + CXL (Micron
CZ122)
Memory
Used
Time
(ms / query)
DDR5 only 0.545
DDR5 + CXL 0.442
23% faster search
time
Intel Confidential 13
Intel Xeon
6700
Intel Confidential 14
Intel Confidential 15
LLM Inference (Llama) Demo Setup
SYSTEM CONFIGURATION
Platform Intel Avenue City
CPU family GNR-AP with 120 physical cores total
SNC 3 mode (Sub-NUMA Clustering)
Native
DRAM
Micron DDR5 – 128GB – 5600 MTs
(total 12 sticks – 12 x 40 GB/s = 480 GB/s) (100% RDs)
CXL
Memory
Micron CZ-120 – 128 GB
(8 modules EDSFF form factor)
Total B/W: 8 x 26 GB/s = 208 GB/s (100% RDs)
(Ratio chosen: 5:2)
OS Red Hat Enterprise Linux
Kernel 6.8 rc-5 – with weighted s/w interleaving enabled
Model llama-2-13b-chat-hf (quantized to int8)
Framework Intel neural-speed framework with AMX enabled
Txt-GUI based user input/output
Intel Confidential 17
30% Performance gain with DDR5 + CXL (Micron
CZ-120) Memory Used Performance
DDR5 only 5.27
tokens/second
DDR5 + CXL 6.88
tokens/second
23% more tokens/sec
Intel Confidential 18
Intel Confidential 19
Summary
 CPUs will play a big role in the AI revolution in the coming years
 There are many AI workloads like RAG, small LLM inferencing
where CPUs can do the job more economically
• Without needing a GPU
 Modern CPUs offer many features like AMX accelerators & CXL
interfaces which can help for efficient execution of AI workloads
 Call to Action:
 Check out Intel Xeon 600/6700/6500 CPUs and featured IHV CXL
memories for boosting your AI performance today
20

Boost Your AI Workload Performance using CXL Memory

  • 1.
    Boost Your AIWorkload Performance using CXL Memory Anil Godbole, Sr. Datacenter Prod. Planning & Mktg Mgr Mar 2025
  • 2.
    Intel Confidential 2 1Source: Intel. Results may vary. Compute Core Count Keeps Increasing  Needed to keep up with memory intensive workloads  Examples • Virtualized servers • In-memory data bases • AI/ML • Many others…
  • 3.
    Intel Confidential 3 ValueProp of CXL-attached Memory Increased Memory Capacity Increased Memory Bandwidth Lower Memory TCO Improve processor perf • Faster execution • Run more VMs/processes Benefitting Workloads • Virtualized Servers • In-memory Databases • AI / ML • HPC (High Perf Computing) • Media (CDN, Video 8K) • Medical (Genomics) Improve processor’s memory bandwidth using address interleaving Benefitting Workloads • AI/ML • HPC • Non-relational Databases Avoid expensive 3DS DIMMs • Use standard DIMM capacities for native & CXL Use lower-cost memory media on CXL • DDR4 • (Future) Persistent memory Memory Pooling • For optimal provision of local DRAM on servers CPU Native DDR5 EDSFF E3 or E1 PCI CEM/Custom Board
  • 4.
    Intel Confidential 4 IntelXeon Roadmap Fully Aligned with CXL Roadmap Intel CXL Enablement Roadmap  Supports CXL v1.1 spec  Leadership in CXL ecosystem enablement 4th & 5th Gen Intel® Xeon® Gen4 (SPR) / Gen5(EMR) Eagle Stream Platform)  Supports CXL v2.0 spec  Enhanced support for CXL Memory  Memory Pooling for PoC (Proof of concept) 6th Gen Intel® Xeon® CPU Gen6 (GNR, SRF) Birch Stream Platform*  Support for CXL v3.X spec Future Gen Intel® Xeon® CPU *Recommend using SKUs at HCC or above
  • 5.
    Intel Confidential 5 IntelXeon Supported CXL Memory Modes H/W-controlled Modes CPU Direct Attach DDR5 EDSFF E3 or E1 PCI CEM Add-in-card S/W-controlled Modes (1) Intel Flat Memory Mode (on BHS)  For system memory expansion  Potential TCO savings with DDR4 reuse on CXL modules (2) Hetero Interleave DRAM and CXL memory address space*  For system memory capacity & b/w expansion  Lowers average latency 1) S/W (Hypervisor/OS/App) assisted tiering (Linear addressing)  For system memory expansion  S/W (O/S, Middleware or Application) controlled Hot/cold page movement 2) S/W based memory Interleaving**  For system memory capacity b/w expansion  S/W controlled page interleaving HW-controlled tiering feature unique to Intel Xeon CPUs; Completely independent of O/S version & data-tiering capabilities * Recommended for W/Ls with good mix of RDs/WRs; Not supported on SRF CPUs * *Requires Linux kernel v6.9 or above
  • 6.
    Intel Confidential 6 IntelHetero-Interleave Mode  Completely H/w-controlled mode • Increase memory capacity & bandwidth • DDR+CXL Memory recognized as a single Numa mode  No page movements  No dependence on O/S-based tiering techniques  System address space ‘striped’ across • 8 / 12 native DRAM channels* • 2 CXL links attached memory ( ~= 4x DDR5 channels)  Total = 12-way / 16-way interleave Results in higher system memory bandwidth^ DDR5 DIMM DDR5 on Buffer Buff Buff Xeon 6 UPI 8x/ 12x DDR 5 channels x16 CXL1.1 x16 CXL1.1 2-way ch interleave 4-way 8-way / 12-way* 2-way ch interleave Intel’s Hetero-Interleave mode beneficial to b/w-hungry WLs like AI / ML No dependency on O/S version/capability ^ Recommended for W/Ls with good mix of RDs/WRs; Not supported on SRF CPUs * 8 ch on X6500/6700 & 12 ch on X6900
  • 7.
    Intel Confidential 7 23% speedup w/ hetero mode(12ch) CXL memory  Hetero mode memory BW Utilization • Read/Write ratio: 2:1 Performance native-only 12ch mode 10 15 20 25 30 35 40 45 100% 123% * BoneAgeAssessment Perf Speedup Hetero Mode Throughput(fps) higher is better Localization Network Regressio nNetwork Heatmap Network gender Bone Age Assessment key points heatmap Input Output AI Inference *123% is using production CXL silicon. Demo is running pre- production silicon that shows 112% speedup. AI-based Image Analysis EM R
  • 8.
    Intel Confidential 8 S/W-AssistedB/W-Weighted Memory Interleaving  SW (Hypervisor/OS/App) responsible for tiering & interleaving  Systems boots as two-tier memory (Near & Far)  S/w ‘stripes’ pages between native & CXL memory • Uses page-table entries to assign physical addresses to virtual address pages  Page-striping ratio (‘M:N’) • No. of pages in native DRAM / No. of pages in CXL memory) • Typically based on ratio of native DRAM memory wrt CXL memory b/w • But completely flexible for S/W to choose  No page movement involved • Pages remain ‘pinned’ in their respective memories Feature Up-streamed in Linux (v6.9+) https://community.intel.com/t5/Blogs/Tech-Innovation/Data-Center/Improve-your-HPC-and-AI-workload-performance-by-increasing/post/1647882
  • 9.
    Intel Confidential 9 BandwidthExpansion with DDR5 + CXL Memory Interleaving weights given by (M,N) pairs Chart with Intel Xeon 6900 with Micron DDR5 DIMMs & CXL CZ-120 modules
  • 10.
    Intel Confidential 10 DemoSetup - Vector Search (FAISS) SYSTEM CONFIGURATION Platform Intel Avenue City CPU family Xeon-6 GNR-AP with 128 physical cores Native DRAM Micron DDR5-64GB (6400MTs) (12 modules ~ 768 GB) CXL Memory Micron CZ122 – 128GB * 8 (8 modules E3.S form factor ~ 1TB) OS Red Hat Enterprise Linux 9.4 Kernel 6.11.6 (weighted interleaving supported) Dataset Microsoft Turing-ANNS (1B points, dim=100, float32) Framewor k FAISS-CPU 1.8.0 (https://faiss.ai/) Index: OPQ128_256-IVF65536_HNSW32-PQ128x4fsr
  • 11.
    Intel Confidential 11 VectorSearch (FAISS) Workload https://faiss.ai Vector search is an important workload commonly used in RAG (Retrieval-Augmented Generation) systems. It enables enable efficient access to relevant information and enhance the quality of generated responses, making AI interactions more accurate and contextually aware.
  • 12.
    Intel Confidential 12 VectorSearch (FAISS): 23% Perf. gain with DDR5 + CXL (Micron CZ122) Memory Used Time (ms / query) DDR5 only 0.545 DDR5 + CXL 0.442 23% faster search time
  • 13.
  • 14.
  • 15.
  • 16.
    LLM Inference (Llama)Demo Setup SYSTEM CONFIGURATION Platform Intel Avenue City CPU family GNR-AP with 120 physical cores total SNC 3 mode (Sub-NUMA Clustering) Native DRAM Micron DDR5 – 128GB – 5600 MTs (total 12 sticks – 12 x 40 GB/s = 480 GB/s) (100% RDs) CXL Memory Micron CZ-120 – 128 GB (8 modules EDSFF form factor) Total B/W: 8 x 26 GB/s = 208 GB/s (100% RDs) (Ratio chosen: 5:2) OS Red Hat Enterprise Linux Kernel 6.8 rc-5 – with weighted s/w interleaving enabled Model llama-2-13b-chat-hf (quantized to int8) Framework Intel neural-speed framework with AMX enabled Txt-GUI based user input/output
  • 17.
    Intel Confidential 17 30%Performance gain with DDR5 + CXL (Micron CZ-120) Memory Used Performance DDR5 only 5.27 tokens/second DDR5 + CXL 6.88 tokens/second 23% more tokens/sec
  • 18.
  • 19.
    Intel Confidential 19 Summary CPUs will play a big role in the AI revolution in the coming years  There are many AI workloads like RAG, small LLM inferencing where CPUs can do the job more economically • Without needing a GPU  Modern CPUs offer many features like AMX accelerators & CXL interfaces which can help for efficient execution of AI workloads  Call to Action:  Check out Intel Xeon 600/6700/6500 CPUs and featured IHV CXL memories for boosting your AI performance today
  • 20.

Editor's Notes

  • #1 I am Anil Godbole… Want to thank Memverge for giving me this opportunity to show why CPUs can still matter in the AI world today (-:
  • #2 Want to start by pointing a simple fact that there is too much pressure on CPU core counts to increase..and CPU mfg’ers are keeping up with that.. The simple reason is that modern workloads have very big memory footprints…and these workloads cannot be satisfied by simply adding more memory…in the end the demand for memory cap & b;w in a server arises simply because there are multiple cores simultaneously crunching the w/l And today of course we will focus on the AI workloads.
  • #3 And to help increase the memory capacity, man invented CXL..As the graphic in the center shows, CXL allows one to augment a CPUs memory capacity beyond what it can do with its DRAM channels alone. Here we explain the value prop of CXL Type3 Memory. First use case is of course to add more memory to the system or mem capacity expansion. We all know that very well. The second box shows how CXL memory can be use to increase total memory bandwidth. And the third box shows how CXL memory can be used to reduce overall memory TCO. But today we will focus on the second box. We will first explore how bandwidth is expanded using a technique called address interleaving. And later we will see how can put this bandwidth for good AI use
  • #4 Before we get fully underway want to show a quick slide to show that Intel remains fully committed to support CXL going ahead.. With the recent launch of Xeon 6700/6500 series of processors primarily for Enterprises, we have now completed the launch of our 6th gen family of Xeons previously codename Granite Rapids. It offers broad support for CXL in the sense all CPU SKUs in the family offer CXL support.
  • #5 Here we are showing the 4 possible configs which an user can deploy when using CXL memory. On the left are the H/w controlled modes – so kernel configuration or revision does not matter. These modes are unique to Intel CPUs. And on the right are the S/w controlled modes. These are controlled either by O/S-based features or using middleware libs like those from Memverge or by application itself. Today we will focus on the b/w expansion modes which are shown below. Of course in these modes the capacity is also increased but there special configuration is done – like interleaved memory addressing – to improve the b/w What can we tell about memory pooling? - Memory Pooling is NOT POR on Granite Rapids. If a customer asks then ask what their requirements are and what memory buffer devices they want to support, but make it clear that it is not POR
  • #6 We will begin with a slide to show the concept of address interleaving. It shows the Intel H/w-based Hetero interleaving mode. The method to increase the memory bandwidth is something they have been using on RAID SSDs for so many years. Rather than store all the data for a given w/l in a single DRAM channel, you stripe it across in chunks across all channels. So now the data access can be accessed 8 times faster if the CPU has 8 mem channels or it can be 12 times faster like on GNR-AP or Xeon 6900 processor. And when you add CXL to the mix then one can increase the interleave ratio even more. Assuming 1 x16 CXL channel to have equivalent b/w of 2 DDR5 ch, we can see that we can boost the interleve ratio by another 4 ways when we deploy 2 cxl channels to the interleaving scheme.
  • #7 Here is an example of how an AI w/l related to bone-age assessment got a perf boost. This is one of the few AI inference w/ls where there is a good mix of Rds/Wrs..
  • #8 And the other approach to increase mem b/w is s/w based. This slide explains the S/w memory interleaving mode which is now available thru the Linux kernel beginning v6.9. I want to acknowledge that our Hosts Memverge along with big contributions from Micron & Hynix contributed this capability to Linux. Here the memory physical pages are interleaved between main DRAM channels & CXL channels. The graphic is showing only 2 nodes – one for main memory & one for all CXL channels, but in practice one can extend that to multiple NUMA nodes when the DRAM channels are further split into sub-NUMA nodes (like for SNC3 mode). The granularity of interleaving is 4KB in contrast to the h/w approach I showed earlier which can do as low as 256 byes.. But the important benefit of the s/w approach is the ability to adjust the allocation ratio between the main memory Numa node & the CXL node. The ratio of interleave (M:N) is chosen based on the b/w ratio between the two memories or NUMA nodes. So when the b/w of one node is lower due to Rd/Wr access pattern of the w/l, one can adjust – actually reduce – the page allocation from that node. We know for Rd-intensive w/ls like most AI Inf w/ls, b/w of CXL memory is lower compared to when we have a w/l with a mix of Rd/Wrs. So we set the M/N ratio to favor the main memory. And the opposite is true for W/Ls with Rd/Wr mix. The latter happens a lot in HPC w/ls. And can also be a TCO play here. Rather than use say all 128GB DIMMs on the CPU memory channels to meet the w/l memory footprint, one can spread the memory between main DRAM & CXL side using less expensive 64GB DIMMs. So this not only saves cost but also gets the b/w boost as explained above.
  • #9 Now one might wonder why the higher latency CXL memory does come in the way of performance when you interleave them with faster memory like DDR5. This slide explains why the net latency of memory access is actually reduced when you do this address interleaving. As the blue (DRAM only curve shows) when multiple cores start doing memory accesses, the memory controller becomes the bottleneck & starts queuing up transactions. That shoots up the latency to the requesting core – not because your DRAM got any slower. But with the added CXL channels the memory requests get steered across or distributed across more memory controllers thereby flattening the latency curve as shown in Orange.
  • #10 Now for the remaining part of my presentation, I am going to cite a few examples of AI w/l which have benefitted from the memory interleaving technique. I just want to acknowledge the various contributions our IHVs who have partnered with us over the last few months to show off the AI w/l performance boost on our CPUs – keep in mind no GPUs were used for this. I wont have time to go thru all the material but will leave it here for reference. I will start with Micron with whom we had showed the RAG dbase acceleration demo at SC’24
  • #11 We used an opern-source dbase called FAISS – I believe contributed by Meta – which has found traction with a few Cloud service providers.
  • #12 We are able to boost the FAISS RAG dbase access time by up to 23%..
  • #13 Showing an example of w/l done by Intel’s internal perf team. This Redis Vector dbase used for RAG was aceelerated using CXL memory from Astera labs.
  • #14 And here is a contribution from our other IHV, Samsung. They showed off how another open-source, popular RAG dbase can be accelerated using s/w memory weighted interleaving
  • #15 See the different interleaving ratios they tried out..And looks like 3:1 ratio was the best.
  • #16 And here is another example of Intel / Micron collaboration where we actually ran the LLAMA v2 LLM on the CPU. As you know in LLM inference the subsequent token generation is memory b/w intensive. By using addr interleaving technique we provided that. And we also deployed the onboard Intel AMX accelerator to do fast matrix multiplications.
  • #17 We were able to boost the token generation by up to 23%
  • #18 And last but not the least, we have an example of In-memory caching dbase acceleration which was contributed by Hynix They showed the popular CacheLib In-memory dbase getting a boost with CXL memory.