Intel Compute Express Link™
Enablement
Anil Godbole, Intel
Feb 2024
2
1 Source: Intel. Results may vary.
2 Source: https://flashmemorysummit.com/English/Collaterals/Proceedings/2018/20180809_NEWM-301A-1_Gervasi.pdf
3 Source: Intel Internal. Estimates – based on large scale deployments and does not include software costs.
Current cost percentage of server memory
compared to other components
1 MB
4 MB
64 MB
16 MB
128 MB
256 MB
512 MB
1 GB
2 GB
4 GB
8 GB
16 GB
32 GB
32 MB
1985 1996 2025
2011
Scaling of DRAM
Density Is Slowing2
DRAM Density Over Time2
~2x every
3 Years
~4x every
3 Years
2x
Every 4 years
Projected
Compute Performance
Growth Is Accelerating1
CPU Core Growth Projection Over Time1
Exponential
CPU Core Growth
2017 2018 2019 2020 2021 2022 2023
Memory Density and Costs Not keeping pace to meet Data Center
Workloads and Infrastructure Cost Requirements
Popular Memory
Intensive Workloads
 AI/ML with LLMs
 Databases & Analytics
 Web-caching apps
 Content Delivery Ntwks
 Virtual Desktop Infra
Today Memory Costs dominate the Server’s BOM
3
CXL Allows Addition of Memory to a Server
Copyright CXL Consortium 2021
4
CXL on Motherboards: Same slot for PCIe OR CXL
 Starting with 5th Gen Xeon (aka SPR) processors
• Flexible port configured for PCIe or CXL during link-up
PCIe x16 @32
Gbs
Connector/Slot
CXL Device
x16 CXL
Intel® Xeon®
Host
Mux
CXL
PCIe
PCIe Device
x16 PCIe
Intel Archer City PCIe slots
5
Augment System Memory with CXL
 Expensive to add more DRAM channels to the CPU
package
 Memory capacity expansion with 2 DPC often causes
drop in total memory b/w (DDR5 5600  DDR5 4800)
 Has lowest memory latency
CPU-attached
DRAM
CPU
Native DDR5
EDSFF E3
or E1
PCI CEM/Custom Board
 Cheaper to add CXL channels to CPU package
• 66 pins for 1 x16 CXL link vs 250 pins for 2x DDR5 channel
• Note: B/w of 1x16 CXL link ~= 2 x DDR5 channels
 Allows for b/w expansion irrespective of DRAM
configuration on CXL Memory buffer
 Reduce TCO by re-use of older DDR4 memory or by
use of cheaper low b/w memory like NVM
 Has higher latency compared to CPU-attached DRAM
CXL-attached
Memory
6
CXL-based memory addition  Memory tiers (NUMA nodes)
Native DRAM (‘Near Memory’) / CXL memory (‘Far Memory’)
Software (Hypervisor/OS/App) assisted
Memory tiering
Mechanism:
 Software does Hot/cold page movement
 Larger granularity (4K+) transfers
 Tracking/telemetry overheads
Hardware-controlled memory tiering:
Options
(1) Interleave DRAM and CXL memory
address space
 System memory & b/w expansion
 Lowers average latency
(2) Intel Flat Memory Mode (on BHS)
 System memory expansion
 TCO reduction
CXL Memory Tiering
Intel’s HW-controlled tiering feature unique to Intel Xeon CPUs;
Systems boots as a single NUMA node, Provides O/S-version agnostic performance gains
7
Intel Xeon Roadmap Fully Aligned with CXL Roadmap
Intel CXL Enabling Strategy
*PoC: Proof of Concept
 Supports CXL v1.1 spec
 Leadership in CXL ecosystem
enablement
4th
& 5th
Gen Intel® Xeon®
Sapphire Rapids (SPR) /
Emerald Rapids (EMR)
CPUs
Eagle Stream Platform)
 Supports CXL v2.0 spec
 Enhanced support for CXL
Memory
 Flat Memory Mode
 Memory Pooling for PoC
6th
Gen Intel® Xeon® CPU
(Granite Rapids (GNR) /
Sierra Forest (SRF) CPUs
Birch Stream Platform
 Support for CXL v3.X spec
Future Gen Intel® Xeon®
CPU
8
CXL is emerging as the industry focal point for coherent IO
 CXL Consortium and OpenCAPI sign letter of intent
to transfer OpenCAPI specification and assets to the CXL Consortium
August 2, 2022, Flash Memory Summit
CXL Consortium and OpenCAPI Consortium Sign Letter of Intent
to Transfer OpenCAPI Assets to CXL
 In February 2022, CXL Consortium and Gen-Z Consortium signed
agreement to transfer Gen-Z specification and assets to CXL Consortium
CXL Standard Firmly Entrenched
Compute Express Link™ and CXL™ Consortium are trademarks of the Compute Express Link Consortium; Confidential | CXL™ Consortium 2020
CXL Board of Directors
Industry Open Standard for
High Speed Communications
250+
Member Companies
9
Summary
 Memory intensive Workloads dominating the Computing landscape today
• Increasing memory capacity purely using CPU-attached DRAM is getting expensive
 CXL protocol, running over the same existing PCIe links, allows for
augmenting system memory footprint at a lower cost
 Intel Xeon® roadmap fully supports CXL starting with Gen-5 Xeon®
CPUs
• Intel CPUs offer unique h/w-based tiering modes which do NOT depend on the O/S
data-movement capabilities
 CXL protocol has full support from all major computing industry players
10
11
Backup
12
Memory Classification
Second Memory Tier;
Total system Addr space = Native DRAM + CXL Memory
Single Memory Tier;
Total system Addr space = Native DRAM + CXL Memory
CXL Memory Attributes
Bandwidth, latency similar to direct attach DDR OR
Lower bandwidth, higher latency vs. direct attach DDR
Bandwidth, latency similar to direct attach DDR OR
Lower bandwidth, higher latency vs. direct attach DDR
Software Considerations
OS version must support CXL memory as next tier.
Perf mgmt. by O/S with AutoNuma OR
By WL or O/S by moving hot/cold pages (4KB+);
Completely H/w managed movement of data between two tiers.
No S/w (WL or O/S) involvement;
Granularity of data movement is Cache-line (64B)  Lower latency
CXL Memory Expansion: Intel Flat Memory
Mode
Host
CXL Memory
Expander
DDRx Memory Channels
System
memory
DDR
5
DDR LPDD
R
NVM
CXL
Plain Memory Tier Flat Memory Mode
CXL
Type-3
card
13
Intel Flat Memory Mode
Special GNR/SRF CXL Memory Expansion mode
 Both DRAM and far memory exposed to OS as combined physical memory
 Data resides in either DRAM or FM - no replication
 Hot data is swapped into DRAM – one cacheline at a time, not a whole 4KB page
 Performance very good due to 1:1 Near/Far memory ratios
Flat MM
(1:1 ratio)
512GB
Far Memory +
512GB DRAM
OS
visible
memory
D
R
A
M
Far
Memory
Flat memory mode feature unique to GNR/SRF on BHS platform
14
Flat Memory Mode Performance Demo Test Configuration
Future Intel Xeon processor code-named “Granite Rapids”
Intel Flat Memory Mode
• Performance test:
• SAP in-memory database HANA*
• Online Analytics Processing (OLAP) workload measuring analytic queries
• OS: SUSE Enterprise Linux SLES 15
• Insights:
• 98% performance when compared to using only all native DDR5 memory
• More than 80% of memory capacity (native DRAM + CXL memory) in use
• Less than 4% miss rate – Intel Flat Memory mode serves more than 96% of memory accesses from native DRAM with
hardware managed tiering between native DRAM (DDR5) and CXL attached DDR4 memory
Future Intel Xeon processor code-named “Granite Rapids”
Using only native DDR5
vs.
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR
5
DDR4 DDR4 DDR4
256GB DDR5 memory 128GB DDR5 memory + 128GB CXL DDR4 memory
CXL CXL CXL
*Note: This is a performance test and not a support statement from SAP
15
CXL Memory Bandwidth Expansion
Value Prop Enable b/w hungry workloads like ML; Enable higher core counts
Interleaving modes
 Interleave across CXL devices within CXL memory Region
 Hetero-Interleave between CPU’s DDR5 & CXL mem (for bandwidth usage expansion only)
 System configuration chosen at boot time
CXL Memory Attributes Bandwidth sustained over CXL link similar to direct attach DDR
Methods
1) Completely H/w-based interleaving – no O/S tiering capability read
2) S/w (O/S, Middleware) based page interleaving
CPU
Direct Attach DDR5
EDSFF E3
or E1
PCI CEM/Custom Board
H/W assisted Hetero Interleaving feature unique to EMR & GNR/SR
16
H/w Assisted Hetero-Interleave Mode (EGS-EMR)
 Completely H/w-controlled tiering mode
• CXL Memory recognized as a single Numa mode
 No page movements
 No dependence on O/S-based tiering techniques
 System address space ‘striped’ across
• 8 native DRAM channels (for 5th gen Xeons)
• 2 CXL links attached memory ( ~= 4x DDR5
channels)
 Total = 12-way interleave
Results in higher system memory bandwidth
DDR5 DIMM
DDR5 on Buffer
Buff
Buff
EMR
UPI
8x DDR 5
channels
x16 CXL1.1
x16 CXL1.1
2-way ch
interleave
4-way
8-way
2-way ch
interleave
Intel’s Hetero-Interleave mode beneficial to b/w-hungry WLs like AI / ML
No dependency on O/S version/capability
17
 23% speedup w/ hetero mode(12ch) CXL
memory
 Hetero mode memory BW Utilization
• Read/Write ratio: 2:1
Performance
100%
123% *
10
15
20
25
30
35
40
45
native-only 12ch mode
Throughput(fps)
BoneAgeAssessment Perf Speedup Hetero Mode
higher is better
Localization
Network
Regression
Network
Heatmap
Network
gender Bone Age
Assessment
key points
heatmap
Input Output
AI Inference
*123% is using production CXL silicon. Demo is running pre-production
silicon that shows 112% speedup.
AI-based Image
Analysis
EMR based Demo

Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)

  • 1.
    Intel Compute ExpressLink™ Enablement Anil Godbole, Intel Feb 2024
  • 2.
    2 1 Source: Intel.Results may vary. 2 Source: https://flashmemorysummit.com/English/Collaterals/Proceedings/2018/20180809_NEWM-301A-1_Gervasi.pdf 3 Source: Intel Internal. Estimates – based on large scale deployments and does not include software costs. Current cost percentage of server memory compared to other components 1 MB 4 MB 64 MB 16 MB 128 MB 256 MB 512 MB 1 GB 2 GB 4 GB 8 GB 16 GB 32 GB 32 MB 1985 1996 2025 2011 Scaling of DRAM Density Is Slowing2 DRAM Density Over Time2 ~2x every 3 Years ~4x every 3 Years 2x Every 4 years Projected Compute Performance Growth Is Accelerating1 CPU Core Growth Projection Over Time1 Exponential CPU Core Growth 2017 2018 2019 2020 2021 2022 2023 Memory Density and Costs Not keeping pace to meet Data Center Workloads and Infrastructure Cost Requirements Popular Memory Intensive Workloads  AI/ML with LLMs  Databases & Analytics  Web-caching apps  Content Delivery Ntwks  Virtual Desktop Infra Today Memory Costs dominate the Server’s BOM
  • 3.
    3 CXL Allows Additionof Memory to a Server Copyright CXL Consortium 2021
  • 4.
    4 CXL on Motherboards:Same slot for PCIe OR CXL  Starting with 5th Gen Xeon (aka SPR) processors • Flexible port configured for PCIe or CXL during link-up PCIe x16 @32 Gbs Connector/Slot CXL Device x16 CXL Intel® Xeon® Host Mux CXL PCIe PCIe Device x16 PCIe Intel Archer City PCIe slots
  • 5.
    5 Augment System Memorywith CXL  Expensive to add more DRAM channels to the CPU package  Memory capacity expansion with 2 DPC often causes drop in total memory b/w (DDR5 5600  DDR5 4800)  Has lowest memory latency CPU-attached DRAM CPU Native DDR5 EDSFF E3 or E1 PCI CEM/Custom Board  Cheaper to add CXL channels to CPU package • 66 pins for 1 x16 CXL link vs 250 pins for 2x DDR5 channel • Note: B/w of 1x16 CXL link ~= 2 x DDR5 channels  Allows for b/w expansion irrespective of DRAM configuration on CXL Memory buffer  Reduce TCO by re-use of older DDR4 memory or by use of cheaper low b/w memory like NVM  Has higher latency compared to CPU-attached DRAM CXL-attached Memory
  • 6.
    6 CXL-based memory addition Memory tiers (NUMA nodes) Native DRAM (‘Near Memory’) / CXL memory (‘Far Memory’) Software (Hypervisor/OS/App) assisted Memory tiering Mechanism:  Software does Hot/cold page movement  Larger granularity (4K+) transfers  Tracking/telemetry overheads Hardware-controlled memory tiering: Options (1) Interleave DRAM and CXL memory address space  System memory & b/w expansion  Lowers average latency (2) Intel Flat Memory Mode (on BHS)  System memory expansion  TCO reduction CXL Memory Tiering Intel’s HW-controlled tiering feature unique to Intel Xeon CPUs; Systems boots as a single NUMA node, Provides O/S-version agnostic performance gains
  • 7.
    7 Intel Xeon RoadmapFully Aligned with CXL Roadmap Intel CXL Enabling Strategy *PoC: Proof of Concept  Supports CXL v1.1 spec  Leadership in CXL ecosystem enablement 4th & 5th Gen Intel® Xeon® Sapphire Rapids (SPR) / Emerald Rapids (EMR) CPUs Eagle Stream Platform)  Supports CXL v2.0 spec  Enhanced support for CXL Memory  Flat Memory Mode  Memory Pooling for PoC 6th Gen Intel® Xeon® CPU (Granite Rapids (GNR) / Sierra Forest (SRF) CPUs Birch Stream Platform  Support for CXL v3.X spec Future Gen Intel® Xeon® CPU
  • 8.
    8 CXL is emergingas the industry focal point for coherent IO  CXL Consortium and OpenCAPI sign letter of intent to transfer OpenCAPI specification and assets to the CXL Consortium August 2, 2022, Flash Memory Summit CXL Consortium and OpenCAPI Consortium Sign Letter of Intent to Transfer OpenCAPI Assets to CXL  In February 2022, CXL Consortium and Gen-Z Consortium signed agreement to transfer Gen-Z specification and assets to CXL Consortium CXL Standard Firmly Entrenched Compute Express Link™ and CXL™ Consortium are trademarks of the Compute Express Link Consortium; Confidential | CXL™ Consortium 2020 CXL Board of Directors Industry Open Standard for High Speed Communications 250+ Member Companies
  • 9.
    9 Summary  Memory intensiveWorkloads dominating the Computing landscape today • Increasing memory capacity purely using CPU-attached DRAM is getting expensive  CXL protocol, running over the same existing PCIe links, allows for augmenting system memory footprint at a lower cost  Intel Xeon® roadmap fully supports CXL starting with Gen-5 Xeon® CPUs • Intel CPUs offer unique h/w-based tiering modes which do NOT depend on the O/S data-movement capabilities  CXL protocol has full support from all major computing industry players
  • 10.
  • 11.
  • 12.
    12 Memory Classification Second MemoryTier; Total system Addr space = Native DRAM + CXL Memory Single Memory Tier; Total system Addr space = Native DRAM + CXL Memory CXL Memory Attributes Bandwidth, latency similar to direct attach DDR OR Lower bandwidth, higher latency vs. direct attach DDR Bandwidth, latency similar to direct attach DDR OR Lower bandwidth, higher latency vs. direct attach DDR Software Considerations OS version must support CXL memory as next tier. Perf mgmt. by O/S with AutoNuma OR By WL or O/S by moving hot/cold pages (4KB+); Completely H/w managed movement of data between two tiers. No S/w (WL or O/S) involvement; Granularity of data movement is Cache-line (64B)  Lower latency CXL Memory Expansion: Intel Flat Memory Mode Host CXL Memory Expander DDRx Memory Channels System memory DDR 5 DDR LPDD R NVM CXL Plain Memory Tier Flat Memory Mode CXL Type-3 card
  • 13.
    13 Intel Flat MemoryMode Special GNR/SRF CXL Memory Expansion mode  Both DRAM and far memory exposed to OS as combined physical memory  Data resides in either DRAM or FM - no replication  Hot data is swapped into DRAM – one cacheline at a time, not a whole 4KB page  Performance very good due to 1:1 Near/Far memory ratios Flat MM (1:1 ratio) 512GB Far Memory + 512GB DRAM OS visible memory D R A M Far Memory Flat memory mode feature unique to GNR/SRF on BHS platform
  • 14.
    14 Flat Memory ModePerformance Demo Test Configuration Future Intel Xeon processor code-named “Granite Rapids” Intel Flat Memory Mode • Performance test: • SAP in-memory database HANA* • Online Analytics Processing (OLAP) workload measuring analytic queries • OS: SUSE Enterprise Linux SLES 15 • Insights: • 98% performance when compared to using only all native DDR5 memory • More than 80% of memory capacity (native DRAM + CXL memory) in use • Less than 4% miss rate – Intel Flat Memory mode serves more than 96% of memory accesses from native DRAM with hardware managed tiering between native DRAM (DDR5) and CXL attached DDR4 memory Future Intel Xeon processor code-named “Granite Rapids” Using only native DDR5 vs. DDR 5 DDR 5 DDR 5 DDR 5 DDR 5 DDR 5 DDR 5 DDR 5 DDR 5 DDR 5 DDR 5 DDR 5 DDR 5 DDR 5 DDR 5 DDR 5 DDR4 DDR4 DDR4 256GB DDR5 memory 128GB DDR5 memory + 128GB CXL DDR4 memory CXL CXL CXL *Note: This is a performance test and not a support statement from SAP
  • 15.
    15 CXL Memory BandwidthExpansion Value Prop Enable b/w hungry workloads like ML; Enable higher core counts Interleaving modes  Interleave across CXL devices within CXL memory Region  Hetero-Interleave between CPU’s DDR5 & CXL mem (for bandwidth usage expansion only)  System configuration chosen at boot time CXL Memory Attributes Bandwidth sustained over CXL link similar to direct attach DDR Methods 1) Completely H/w-based interleaving – no O/S tiering capability read 2) S/w (O/S, Middleware) based page interleaving CPU Direct Attach DDR5 EDSFF E3 or E1 PCI CEM/Custom Board H/W assisted Hetero Interleaving feature unique to EMR & GNR/SR
  • 16.
    16 H/w Assisted Hetero-InterleaveMode (EGS-EMR)  Completely H/w-controlled tiering mode • CXL Memory recognized as a single Numa mode  No page movements  No dependence on O/S-based tiering techniques  System address space ‘striped’ across • 8 native DRAM channels (for 5th gen Xeons) • 2 CXL links attached memory ( ~= 4x DDR5 channels)  Total = 12-way interleave Results in higher system memory bandwidth DDR5 DIMM DDR5 on Buffer Buff Buff EMR UPI 8x DDR 5 channels x16 CXL1.1 x16 CXL1.1 2-way ch interleave 4-way 8-way 2-way ch interleave Intel’s Hetero-Interleave mode beneficial to b/w-hungry WLs like AI / ML No dependency on O/S version/capability
  • 17.
    17  23% speedupw/ hetero mode(12ch) CXL memory  Hetero mode memory BW Utilization • Read/Write ratio: 2:1 Performance 100% 123% * 10 15 20 25 30 35 40 45 native-only 12ch mode Throughput(fps) BoneAgeAssessment Perf Speedup Hetero Mode higher is better Localization Network Regression Network Heatmap Network gender Bone Age Assessment key points heatmap Input Output AI Inference *123% is using production CXL silicon. Demo is running pre-production silicon that shows 112% speedup. AI-based Image Analysis EMR based Demo

Editor's Notes

  • #4 As you all know PCIe has been the primary link connecting external devices to CPU over almost 2 decades. The PCIe link of course has evolved & addressed the bw needs & new features reqd by modern CPUs. So then many of you probably wonder about the need to invent a new CXL protocol. Slide showing the challenges not quite addressed by the existing PCIe links between processors & devices. Under the hood, CXL includes 3 sub-protocols: The first is exactly same as the PCIe protocol – called CXL.io – and the other two are the new Coherent protocols (CXL.mem / .cache) However this does not mean PCIe has no future. There are certain applications – like those related to big data block transfers – which are better addressed by PCIe than CXL. But that is not the subject we will discuss today.
  • #6 We now have a way to add to the system memory using CXL links..
  • #9 CXL Consortium got a big momentum boost in yr2022 with both OpenCAPI (IBM’s coherent link protocol) & GenZ (another coherent fabric protocol pushed by major OEMs) merged their assets with those of CXL. Later another coherent link group (CCIX) followed suit. Today the CXL consortium stands 250+ companies tall & here to stay
  • #13 These are the basic cases when the memory is directly attached locally on the CXL end-point & the memory is pretty much captive to the local host. And one could consider that Memory Tiering is really a special case of Memory expansion but here the S/W can play a bigger role in performing tiering-optimizations whereby it can move pages in / out of tiered memory based on application execution dynamics. And this is especially so when a lower bw memory like Persistent memory is used. Now contrast this with Flat Memory mode which is unique feature offered only on Intel cpus starting with the GNR generation. In this mode the BIOS which initially enumerates the system memory (native + CXL memory) presents this to the O/S as only 1 Numa node or one tier. So O/S based page-movement is not invoked during system operation. Instead the CPU h/w will swap cachelines with the CXL memory when a miss occurs in the DRAM. This is a quick 64byte transfer unlike the full 4K page in a O/S based data movement – meaning the workload is stalled only briefly. Since this is a h/w controlled data movement there is no dependence on Linux version for a particular CXL page movement capability. Any Linux kernel – as early as v5.1 – which can detect CXL memory will suffice. The O/S is still of course needed for housekeeping tasks like launching applications, error handling etc.
  • #14 Want to share a CXL-memory expansion mode which is unique to GNR/SRF family of CPUs. Want to compare this with other CXL-memory expansion use cases we discussed so far. Unlike the other memory expansion modes where the additional memory resides in a separate tier & any cachelines in this tier will incur a higher latency, in the Flat2LM mode the fetched cache-line is swapped with the corresponding cache-line in the near-memory. Note that this is always possible since the ratio between the two memory sizes is 1:1 FlatMM offers even better performance compared to Memory-mode of Intel Optane persistent memory since over there the ratio was 1:4 leading to more un-related evictions when swapping was done. Of course one has to provision the same amount of memory on the CXL-side. FlatMM’s value prop is the big memory TCO reduction. A big use case emerging is the re-use of old generation of DDR memories which otherwise would have been recycled. So I term this advantage as ‘2 for the price of 1’.
  • #15 Testing done on SAP HANA database, SUSE Enterprise Linux SLES 15 with OS kernel 5.19, GNR X3 A stepping 66C, BHSDCRB1.SYS.2526.D01.2307311547 BIOS version, with DDR5 and CXL DDR4 x8 memory Even though we are using DDR4 memory on the CXL side the hit to perf is only 2%. Keep in mind that Flat MM is a TCO play – it is not a perf improvement play. We want to show that in spite of using a cheaper lower b/w memory on the CXL side
  • #16 And the other big use case of CXL attached memory is BW expansion. Today we already use Memory interleaving when accessing local DRAM by simultaneously accessing all available channels. So imagine getting a big chunk of data out of just one DRAM channel sequentially if one could split that chunk across all available 8 DRAM channels & fetch multiple smaller chunks simultaneously. With CXL attached memory, one can extend this idea to add CXL memory to the above interleaving scheme. And all this happens completely under the hood – meaning the CPU h/w can be configured at boot time to access memory in this manner – w/out the s/w having to do any addressing tricks. This feature is unique to Intel CPUs. It is supported on both Eagle Stream & Birch Stream platforms.