During the CXL Forum at OCP Global Summit, Michael Ocampo of Astera Labs explained the problem of the memory wall, and how CXL memory powered by Astera Labs can break through
2. Breaking Through the Memory Wall with
CXLTM
Michael Ocampo, Sr. Product Manager, Ecosystem/Cloud-Scale Interop Lab
3. Agenda
• Breaking Through the Memory Wall
• Memory Bound Use Cases
• CXL for Modular Shared Infrastructure
• Critical CXL Collaboration Happening Now
• Call to Action
4. Breaking Through the Memory Wall
Challenges with Previous Attempts
1. Memory BW and capacity did not scale
efficiently
2. Latency inferior to local CPU memory
3. Not deployable at scale
4. Not easily adopted by existing applications
Breaking Through the Memory Wall with CXL
1. Increase server memory BW and capacity by 50%
2. Reduce latency by 25%
3. Standard DRAM for flexible supply chain and cost
4. Seamlessly expand memory for existing and new
applications
5. Memory Bound Use Cases
• eCommerce & Business
Intelligence
• Online Transaction
Processing
• Online Analytics Processing
• AI Inferencing
• Recommendation Engines
• Semantic Cache
What is
happening
?
OLTP
What has
happened
?
OLAP
Opportunity for CXL to Boost MySQL Database
Performance
Opportunity for CXL to Boost Vector Database
Performance
Vector
DB
Vector Database
Inference Server
REST Models
Query
Inference
Users
Query/Store
Inference
6. OLTP & OLAP Results
0%
20%
40%
60%
80%
100%
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Q17
Q18
Q19
Q20
Q21
Q22
Average
Query
Times
(Normalized)
TPC-H Query Times
DRAM+CXL DRAM-Only
System Under Test Configuration
CPU
Storage
Local Memory
CXL-Attached Memory
Mode
Benchmark
5th Gen Intel® Xeon® Scalable Processor (Single-Socket
4x NVMe PCIe 4.0 SSDs
512GB (8x 64GB DDR5-5600)
256GB (4x 64GB DDR5-5600)
12-Way Heterogenous Interleaving
TPC-H (1000 scale factor)
Cut Big Query Times in Half with CXL Memory
OLAP
System Under Test Configuration
CPU
Storage
Local Memory
CXL-Attached Memory
Mode
Benchmark
4th Gen Intel® Xeon® Scalable Processor (Single-Socket)
2x NVMe PCIe 4.0 SSDs
128GB (8x 16GB DDR5-4800)
128GB (2x 64GB DDR5-5600)
Memory Tiering (MemVerge Memory Machine)
Sysbench (Percona Labs TPC-C Scripts)
150% More TPS with only 15% More CPU Utilization
0%
40%
80%
120%
160%
200%
240%
280%
0 200 400 600 800 1000
TPS
(Normalized)
Clients
Transactions per Second (TPS)
DRAM DRAM + CXL
0%
10%
20%
30%
40%
50%
60%
0 200 400 600 800 1000
CPU
Utilization
(Normalized)
Clients
CPU Utilization
DRAM DRAM + CXL
OLTP
7. Breaking the Memory Wall for Databases
Popular Certified & Supported SAP HANA® Hardware
48 DIMMs with Two 2-Socket Systems
High kW
High
TCO
Without CXL
Optimized Hardware for In-Memory Databases
56 DIMMs with One 2-Socket System
With CXL
Lower kW
Lower TCO
DDR5 4800 DDR5 4800
Interleaving across CXL-Attached Memory
2.33x memory capacity and 1.66x memory bandwidth per socket with
CXL
Lower TCO for memory-intensive application
DDR5 4800 DDR5 4800
x16
x16
x16
x16
x16
x16
x16
x16
8. • OCP Alignment with M-SIF and CMS:
• Shared Elements with CXL Support
• Pluggable Multi-Purpose Module
(PMM)
• 1U Support for High-Density
• Standard DIMM Support
• High Power Connector (200W-
600W)
• Hot-plug Support
Modular Shared Infrastructure (M-
SIF)
16-24 DIMMs per
Host
Core
Element
8-16 DIMMs per HIB
Shared Element
JBOF JBOG JBOM
Logical Wiring within
Enclosure
Node 1 Node 2 Node 3-4 Node 5
• Challenges:
• Signal Integrity
• Link Bifurcation & Configuration
• Latency/Performance
• DIMM Interoperability
PWR/PCIe/CXL BPN
CXL Controllers
PCIe/CXL Retimers
CXL.
MEM
SW
CXL.
MEM
PMM(s)
Host Interface Board
CPU
9. Enabling CXL Connectivity for M-SIF
PCIe
Retimer
Use Case: Memory Expansion
Real-time Apps
MB / PCI CEM Connectivity
Use Case: JBOM Enablement
Intelligent Tiering/Placement
Midplane or Backplane Connectivity
Use Case: Shared/Pooled Memory
High-Capacity In-Memory Compute
PCIe Cabling Connectivity
CXL CXL CXL
CPU
PCIe
Retimer
Local CXL-Attached Short Reach, CXL-Attached Long Reach, CXL-Attached
PCIe Cabling PCIe
Retimer
CPU
Backplane
Leo CXL Memory Connectivity
Aries PCIe/CXL Smart Retimers
Direct
10. Extending Reach for PCIe/CXL
Memory
0.0x
0.1x
0.2x
0.3x
0.4x
0.5x
0.6x
0.7x
0.8x
0.9x
1.0x
1.1x
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
110.00%
CXL CXL + 1 Retimer CXL + 2 Retimers
Relative
Latency
Relative
Bandwidth
Relative MLC Performance with and without Retimers
Bandwidth Latency
Minimal Impact to Performance with Extended Reach
11. Critical Collaboration Happening
Now
• DIMM Stability & Performance
• OS Development & Feature Testing
• CXL 2.0 RAS & Telemetry
• SW Integration & Orchestration
• High Performance Interleaving
• High-Capacity Memory Density Tiering
Breaking Through
the Memory Wall
Host & DDRx Interop
Cloud-Scale
Fleet Management
12. Call to Action
Visit Check out how we
smashed through Memory
[OCP Map and where we
are]
Learn More
www.opencompute.org
• wiki/server/DC-MHS/M-SIF Base Spec:
0.5
• wiki/server/CMS: CMM Proposal
• wiki/hardware_management/DC-SCM:
2.0
CXL Resources
• Linux: https://pmem.io/ndctl/collab
• Interop:
https://www.asteralabs.com/interop
Ecosystem Alliance Contact:
• michael.ocampo@asteralabs.com
See the Demos
Booth A11
Etherne
t
Scale memory
bandwidth and
capacity
Scale parallel
processing of
accelerators
Scale rack
connectivity over
copper
Peak compute has increased by 60,000x over the past 20 years
As opposed to 100x for DRAM or 30x for interconnect bandwidth.
Adopting emerging memory technologies that try to address this problem, takes a long time, and be able to scale, manufacture reliably
With ours, we’re using standard DRAM…
eCommerce & Data Warehousing
Boost OLTP throughput by 2.5x
Cuts OLAP query times in half
Caching LLM prompts is a great way to reduce expenses. It works by using vector search to identify similar prompts and then returning its response. If there are no similar responses, the request is passed onto the LLM provider to generate the completion.
Vector DBs don’t need to be bound by inference server. It may be decoupled from the process life cycle, enabling multiple instances to leverage the same cache. After receiving a query, inference server can
Computes a hash of the input query, including the tensor and some metadata. This becomes the inference key.
Checks Redis for a previous run inference.
Returns that inference, if it exists.
Runs the tensor through the model if the inference does not exist.
Stores the inference in the vector DB.
Returns the inference.
12 DIMMs per socket vs 28 DIMMs per socket
6 channels per socket vs 16 channels per socket
RAM per Core is roughly the same.
48 DIMMs * 256GB = 12.2TB (~76GB per core), 56 * 256GB = 14.3 (~75GB per core)
Ice Lake supports 40 cores per socket. 4 socket = 160 cores. Genoa supports 96 cores. 2 socket = 192 cores.
PCIe 4.0 system: Lenovo ThinkSystem SR860 V2
PCIe 5.0 system: ThinkSystem SR675 V3
M-SIF: Shared Infrastructure: Improve interoperability related to share infrastructure enclosures with multiple serviceable modules. Modules containing elements are blind-matable and hot-pluggable into shared infrastructure enclosure.
PMM: 167 mm long x 125 mm wide x 18.30 mm thick form factor
6.5” long x 4.9” wide x .72” thick
Backplane Specification: OIF CEI-28G-LR: Delivers low crosstalk noise and low insertion loss while minimizing channel performance variation for every differential pair. Enables a scalable migration path beyond 25Gb/s.
At Astera, we’ve launched a Cloud-Scale Interop Lab for our retimer and CXL Type 3 device. Not only are we testing with all major CPU and memory vendors, but OS vendors as well.
We believe seamless interoperability is critical for ecosystem enablement, and there are a myriad of host systems & DDR5-4800 & -5600 modules that we support with our Leo Memory Connectivity Platform. Jonathan Zhang, Sr. Director of Software Engineering at Astera Labs, presented some of the challenges of DDR5 interop testing yesterday in the CMS track, but I’ll just mention here that it takes a tremendous engineering effort to continuously run regression tests on as many DIMMs as possible, but of course, we “don’t want to confuse the work behind the work with delivering actual customer value.” So, it’s important to continue collaborating across the ecosystem to optimize benchmarks with different HW & SW configurations to understand what works best per use case and customer scenario. RAS & Telemetry are also vital for the usability of the technology. These features are essential to support pre- and post- OS, which we’re working diligently on with customers and partners to support software development and integration to configure, control CXL memory, as well as log events and alert the host as needed. This is really a continuum to support not just orchestration for cloud-scale fleet management, but for managing memory tiers as well as hypervisors. Thanks to community efforts through the CXL Linux Sync, we’re seeing significant progress for core infrastructure services, such as media management, error handling, and CXL performance monitoring
Through this foundational work, we’re seeing fruitful collaborations with all OS vendors to support applications on bare-metal, VMs, containers and back-end applications, such as databases, without major software changes or provisioning scripts.
eCommerce & Data Warehousing
Boost OLTP throughput by 2.5x
Cuts OLAP query times in half
Caching LLM prompts is a great way to reduce expenses. It works by using vector search to identify similar prompts and then returning its response. If there are no similar responses, the request is passed onto the LLM provider to generate the completion.
eCommerce & Data Warehousing
Boost OLTP throughput by 2.5x
Cuts OLAP query times in half
AI Inferencing
Boosts ResNet-50 IPS by 1.5x
Accelerate Vector Search by 2x
eCommerce & Data Warehousing
Boost OLTP throughput by 2.5x
Cuts OLAP query times in half
AI Inferencing
Boosts ResNet-50 IPS by 1.5x
Accelerate Vector Search by 2x