Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)

Intel Compute Express Link™
Enablement
Anil Godbole, Intel
Feb 2024

2
1 Source: Intel. Results may vary.
2 Source: https://flashmemorysummit.com/English/Collaterals/Proceedings/2018/20180809_NEWM-301A-1_Gervasi.pdf
3 Source: Intel Internal. Estimates – based on large scale deployments and does not include software costs.
Current cost percentage of server memory
compared to other components
1 MB
4 MB
64 MB
16 MB
128 MB
256 MB
512 MB
1 GB
2 GB
4 GB
8 GB
16 GB
32 GB
32 MB
1985 1996 2025
2011
Scaling of DRAM
Density Is Slowing2
DRAM Density Over Time2
~2x every
3 Years
~4x every
3 Years
2x
Every 4 years
Projected
Compute Performance
Growth Is Accelerating1
CPU Core Growth Projection Over Time1
Exponential
CPU Core Growth
2017 2018 2019 2020 2021 2022 2023
Memory Density and Costs Not keeping pace to meet Data Center
Workloads and Infrastructure Cost Requirements
Popular Memory
Intensive Workloads
 AI/ML with LLMs
 Databases & Analytics
 Web-caching apps
 Content Delivery Ntwks
 Virtual Desktop Infra
Today Memory Costs dominate the Server’s BOM

3
CXL Allows Addition of Memory to a Server
Copyright CXL Consortium 2021

4
CXL on Motherboards: Same slot for PCIe OR CXL
 Starting with 5th Gen Xeon (aka SPR) processors
• Flexible port configured for PCIe or CXL during link-up
PCIe x16 @32
Gbs
Connector/Slot
CXL Device
x16 CXL
Intel® Xeon®
Host
Mux
CXL
PCIe
PCIe Device
x16 PCIe
Intel Archer City PCIe slots

5
Augment System Memory with CXL
 Expensive to add more DRAM channels to the CPU
package
 Memory capacity expansion with 2 DPC often causes
drop in total memory b/w (DDR5 5600  DDR5 4800)
 Has lowest memory latency
CPU-attached
DRAM
CPU
Native DDR5
EDSFF E3
or E1
PCI CEM/Custom Board
 Cheaper to add CXL channels to CPU package
• 66 pins for 1 x16 CXL link vs 250 pins for 2x DDR5 channel
• Note: B/w of 1x16 CXL link ~= 2 x DDR5 channels
 Allows for b/w expansion irrespective of DRAM
configuration on CXL Memory buffer
 Reduce TCO by re-use of older DDR4 memory or by
use of cheaper low b/w memory like NVM
 Has higher latency compared to CPU-attached DRAM
CXL-attached
Memory

6
CXL-based memory addition  Memory tiers (NUMA nodes)
Native DRAM (‘Near Memory’) / CXL memory (‘Far Memory’)
Software (Hypervisor/OS/App) assisted
Memory tiering
Mechanism:
 Software does Hot/cold page movement
 Larger granularity (4K+) transfers
 Tracking/telemetry overheads
Hardware-controlled memory tiering:
Options
(1) Interleave DRAM and CXL memory
address space
 System memory & b/w expansion
 Lowers average latency
(2) Intel Flat Memory Mode (on BHS)
 System memory expansion
 TCO reduction
CXL Memory Tiering
Intel’s HW-controlled tiering feature unique to Intel Xeon CPUs;
Systems boots as a single NUMA node, Provides O/S-version agnostic performance gains

7
Intel Xeon Roadmap Fully Aligned with CXL Roadmap
Intel CXL Enabling Strategy
*PoC: Proof of Concept
 Supports CXL v1.1 spec
 Leadership in CXL ecosystem
enablement
4th
& 5th
Gen Intel® Xeon®
Sapphire Rapids (SPR) /
Emerald Rapids (EMR)
CPUs
Eagle Stream Platform)
 Supports CXL v2.0 spec
 Enhanced support for CXL
Memory
 Flat Memory Mode
 Memory Pooling for PoC
6th
Gen Intel® Xeon® CPU
(Granite Rapids (GNR) /
Sierra Forest (SRF) CPUs
Birch Stream Platform
 Support for CXL v3.X spec
Future Gen Intel® Xeon®
CPU

8
CXL is emerging as the industry focal point for coherent IO
 CXL Consortium and OpenCAPI sign letter of intent
to transfer OpenCAPI specification and assets to the CXL Consortium
August 2, 2022, Flash Memory Summit
CXL Consortium and OpenCAPI Consortium Sign Letter of Intent
to Transfer OpenCAPI Assets to CXL
 In February 2022, CXL Consortium and Gen-Z Consortium signed
agreement to transfer Gen-Z specification and assets to CXL Consortium
CXL Standard Firmly Entrenched
Compute Express Link™ and CXL™ Consortium are trademarks of the Compute Express Link Consortium; Confidential | CXL™ Consortium 2020
CXL Board of Directors
Industry Open Standard for
High Speed Communications
250+
Member Companies

9
Summary
 Memory intensive Workloads dominating the Computing landscape today
• Increasing memory capacity purely using CPU-attached DRAM is getting expensive
 CXL protocol, running over the same existing PCIe links, allows for
augmenting system memory footprint at a lower cost
 Intel Xeon® roadmap fully supports CXL starting with Gen-5 Xeon®
CPUs
• Intel CPUs offer unique h/w-based tiering modes which do NOT depend on the O/S
data-movement capabilities
 CXL protocol has full support from all major computing industry players

12
Memory Classification
Second Memory Tier;
Total system Addr space = Native DRAM + CXL Memory
Single Memory Tier;
Total system Addr space = Native DRAM + CXL Memory
CXL Memory Attributes
Bandwidth, latency similar to direct attach DDR OR
Lower bandwidth, higher latency vs. direct attach DDR
Bandwidth, latency similar to direct attach DDR OR
Lower bandwidth, higher latency vs. direct attach DDR
Software Considerations
OS version must support CXL memory as next tier.
Perf mgmt. by O/S with AutoNuma OR
By WL or O/S by moving hot/cold pages (4KB+);
Completely H/w managed movement of data between two tiers.
No S/w (WL or O/S) involvement;
Granularity of data movement is Cache-line (64B)  Lower latency
CXL Memory Expansion: Intel Flat Memory
Mode
Host
CXL Memory
Expander
DDRx Memory Channels
System
memory
DDR
5
DDR LPDD
R
NVM
CXL
Plain Memory Tier Flat Memory Mode
CXL
Type-3
card

13
Intel Flat Memory Mode
Special GNR/SRF CXL Memory Expansion mode
 Both DRAM and far memory exposed to OS as combined physical memory
 Data resides in either DRAM or FM - no replication
 Hot data is swapped into DRAM – one cacheline at a time, not a whole 4KB page
 Performance very good due to 1:1 Near/Far memory ratios
Flat MM
(1:1 ratio)
512GB
Far Memory +
512GB DRAM
OS
visible
memory
D
R
A
M
Far
Memory
Flat memory mode feature unique to GNR/SRF on BHS platform

14
Flat Memory Mode Performance Demo Test Configuration
Future Intel Xeon processor code-named “Granite Rapids”
Intel Flat Memory Mode
• Performance test:
• SAP in-memory database HANA*
• Online Analytics Processing (OLAP) workload measuring analytic queries
• OS: SUSE Enterprise Linux SLES 15
• Insights:
• 98% performance when compared to using only all native DDR5 memory
• More than 80% of memory capacity (native DRAM + CXL memory) in use
• Less than 4% miss rate – Intel Flat Memory mode serves more than 96% of memory accesses from native DRAM with
hardware managed tiering between native DRAM (DDR5) and CXL attached DDR4 memory
Future Intel Xeon processor code-named “Granite Rapids”
Using only native DDR5
vs.
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR4 DDR4 DDR4
256GB DDR5 memory 128GB DDR5 memory + 128GB CXL DDR4 memory
CXL CXL CXL
*Note: This is a performance test and not a support statement from SAP

15
CXL Memory Bandwidth Expansion
Value Prop Enable b/w hungry workloads like ML; Enable higher core counts
Interleaving modes
 Interleave across CXL devices within CXL memory Region
 Hetero-Interleave between CPU’s DDR5 & CXL mem (for bandwidth usage expansion only)
 System configuration chosen at boot time
CXL Memory Attributes Bandwidth sustained over CXL link similar to direct attach DDR
Methods
1) Completely H/w-based interleaving – no O/S tiering capability read
2) S/w (O/S, Middleware) based page interleaving
CPU
Direct Attach DDR5
EDSFF E3
or E1
PCI CEM/Custom Board
H/W assisted Hetero Interleaving feature unique to EMR & GNR/SR

16
H/w Assisted Hetero-Interleave Mode (EGS-EMR)
 Completely H/w-controlled tiering mode
• CXL Memory recognized as a single Numa mode
 No page movements
 No dependence on O/S-based tiering techniques
 System address space ‘striped’ across
• 8 native DRAM channels (for 5th gen Xeons)
• 2 CXL links attached memory ( ~= 4x DDR5
channels)
 Total = 12-way interleave
Results in higher system memory bandwidth
DDR5 DIMM
DDR5 on Buffer
Buff
Buff
EMR
UPI
8x DDR 5
channels
x16 CXL1.1
x16 CXL1.1
2-way ch
interleave
4-way
8-way
2-way ch
interleave
Intel’s Hetero-Interleave mode beneficial to b/w-hungry WLs like AI / ML
No dependency on O/S version/capability

17
 23% speedup w/ hetero mode(12ch) CXL
memory
 Hetero mode memory BW Utilization
• Read/Write ratio: 2:1
Performance
100%
123% *
10
15
20
25
30
35
40
45
native-only 12ch mode
Throughput(fps)
BoneAgeAssessment Perf Speedup Hetero Mode
higher is better
Localization
Network
Regression
Network
Heatmap
Network
gender Bone Age
Assessment
key points
heatmap
Input Output
AI Inference
*123% is using production CXL silicon. Demo is running pre-production
silicon that shows 112% speedup.
AI-based Image
Analysis
EMR based Demo

Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)

More Related Content

What's hot

Similar to Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)

More from Memory Fabric Forum

Recently uploaded

Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)

Editor's Notes