Q1 Memory Fabric Forum: Big Memory Computing for AI

Big Memory
Computing for AI
February 2024

Growth of AI and the Memory Wall
2

The Emergence of the AI Computer
3
x86 Era AI Era
Compute
x86 CPUs
DDR Memory
Data
IP Storage
Connectivity
IP Networks
Compute
GPUs, AI Processors
HBM
Data
Fabric-Attached
Memory
AI Fabric
NVLink
CXL

CXL (Compute Express Link)
4
CXL Fabric
Memory
V1 Memory Expansion
• Both capacity and bandwidth
• Spec: 2019; Product: 2023
V2 Memory Pooling
• Elastic memory on-demand
• Spec: 2021; Product: 2024
V3 Memory Sharing
• Cache coherence
• Switch cascading
• Spec: 2022; Product: 2026+

MemVerge CXL Software
5
Memory Machine™ X
Server Memory Expansion
CXL 1.1
Tier Your Memory
Fabric-Attached Memory
Share Your Memory
CXL 2.0

Q1’24 Memory Fabric Forum

7
E3.S Memory Modules
App Server
CXL 1.1
PCIe Add-In Cards (AICs)
DDR DIMMs

CXL Expansion vs. DDR DRAM
Performance
8
0
200
400
600
800
1,000
1,200
1,400
0 50 100 150 200 250
Latency
(ns)
Throughput (GB/s)
DRAM vs. CXL: Consolidated Latency vs Throughput
1 DIMM DDR5 4800
4 DIMM DDR5 4800
16 DIMM DDR5 4800
CXL Gen5 x8

Intelligent Memory Placement Engine
9
QoS Policy Engine supports
multiple policies to maximize
bandwidth and minimize latency
Auto-tiering policy exhibits superior
performance than hardware
interleaving or OS kernel tiering

Real-Time CPU, DRAM, and CXL Metrics

Multi-Vendor CXL Device Support

Memory Machine™
13
Available now! Contact mike.hoey@memverge.com to request a PoC.
Memory Machine™ X
(auto-tiering)
Big Memory Appliance
Learn more on February 9 at 12:30 PT
Using CXL with AI Applications
Steve Scargall
MemVerge
Register to attend
X

Q1’24 Memory Fabric Forum

15
Embedded Switch
E3.S Memory Modules
ToR Switch
Embedded Switch E3.S Memory Modules
GFAM System GFAM System
Servers Servers

Transmitting and Sharing Data between Processes
16
TCP/IP/Ethernet
MPI/NCCL
Node
Socket, Queue, Pipe
Process
File
Single
Node
Multi-
Node
DRAM
Proces
s
Process
Node
Process Process
Process
Node
Process Process
Process
Node
Proc.
Node
Proc.
Node
Proc.
Node
Proc.
Node
Proc.
Node
Proc.
Node
Proc.
Node
Proc.
Node
Proc.
Networked Storage CXL Memory
Message Passing Shared Storage Shared Memory

When & Why is Shared Memory Preferred
17
• Shared Memory Vs. Message Passing:
• Benefit: Take out the networking
• Cost: Cache coherence & synchronization
• Sweet Spot: when the R:W ratio is high
• CRUD -> CRAP
• Not general purpose SHM, so not fit for the kernel
Node
Proc.
Node
Proc.
CXL Memory
Node
Proc.
Node
Proc.
Local
Memory
Local
Memory
• Other potential considerations:
o 1 to N communication
o Data not easily shardable
o Saving memory capacity cost

When & Why is Shared Memory Preferred
18
• Vs. Shared Storage:
• Sweet Spot: When the performance requirement is high
• When the data does not need to be persisted permanently
Node
Proc.
Node
Proc.
Node
Proc.
Node
Proc.
Node
Proc.
Node
Proc.
Networked Storage CXL Memory

Introducing Gismo Software
Global I/O-free Shared Memory Objects
19
Node
App
Gismo Library
Gismo
Manager
CPU
Node
App
Gismo Library
CPU
Node
App
Gismo Library
CPU
CXL Shared Memory
DDR DRAM
(NUMA 0)
DDR DRAM
(NUMA 0)
DDR DRAM
(NUMA 0)

Use Case 1: AI/ML
20
Baseline Ray Ray + Gismo
Local Memory
Node
Raylet
Worker Worker
Object
Store
Local Memory
Node
Raylet
Worker Worker
Object
Store
Serialize & copy over network
Node
Raylet
Worker Worker
Gismo
CXL Shared Memory
Node
Raylet
Worker Worker
Gismo
Local Memory Local Memory

Baseline Ray With Gismo
Local Get 1GB object 0.4 sec
0.4 sec - CXL shared memory
as fast as local memory
Remote Get 1GB object 2.7 sec 0.4 sec - 675% faster
Shuffle 50GB
4 nodes, each 4 cores,
128 GB object store
515 sec 185 sec - 280% faster
* Running in emulation environment
Shuffle Benchmark Results
21

Benefits of Ray + Gismo
22
IO-free: Eliminates object
serialization and transfers over
network for remote object access
Zero Copy: No more duplicate
object copies on different nodes
No Spilling: Reduces object
spilling and data skewing with each
node accessing memory pool
Node
Raylet
Worker Worker
Gismo
CXL Shared Memory
Node
Raylet
Worker Worker
Gismo
Local Memory Local Memory

Memory Machine X Sharing
23
Alpha Availability
Q1’24 Q2’24 Q4’24
Q3’24
Please contact cxl@memverge.com if interested

Introducing MemVerge
24
Award-winning software for Big Memory Computing
Memory Machine™ X
Fabric-attached CXL memory for AI
Memory Machine™ Cloud
Hybrid cloud compute platform
Select Customers

Q1 Memory Fabric Forum: Big Memory Computing for AI

Q1 Memory Fabric Forum: Big Memory Computing for AI

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Q1 Memory Fabric Forum: Big Memory Computing for AI

Similar to Q1 Memory Fabric Forum: Big Memory Computing for AI (20)

More from Memory Fabric Forum

More from Memory Fabric Forum (20)

Recently uploaded

Recently uploaded (20)

Q1 Memory Fabric Forum: Big Memory Computing for AI