How Coupang Leverages Distributed Cache to Accelerate ML Model Training

How Coupang Leverages Distributed
Cache to Accelerate ML Model
Training
April 22, 2025
Hyun Jung Baek, Staff Backend Engineer @ Coupang

Coupang Confidential and Proprietary
Coupang is a technology and Fortune 200 company listed on
the New York Stock Exchange (NYSE: CPNG) that provides
retail, restaurant delivery, video streaming, and fintech services
to customers around the world under brands that include
Coupang, Coupang Eats, Coupang Play and Farfetch.
Coupang is a Technology and
Fortune 200 Company (NYSE: CPNG)

Machine Learning Impacts Every Aspect of Commerce
Experiences of Coupang Customers
Product Catalog Search Pricing
Robotics Inventory Fulfilment
Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172

Core offerings
• Notebooks & ML Pipeline Authoring
• Model Training
• Model Inference
• Monitoring & Observability
Coupang’s ML Platform Overview

Both AWS Multi-Region & On-prem GPU
Clusters
● Cloud GPU clusters across AWS Asia-
Pacific & US regions
● On-prem data center (compute &
storage)
Requirements
● Resource efficiency
○ GPU utilization
● High I/O throughput
● Developer experience
● Cloud cost optimization
Hybrid & Multi-Region Compute & Storage Due to GPU Shortage
Monitoring GPU utilization of Training cluster

Previous Architecture
ap-region
On Prem
Local Storage
GPU Training Cluster
Data Copy
Data Lake
ap-region
Local Storage
us-region

● Required preparation step (copy and validation) before training jobs
○ Added day-long delay before training on a dataset
● Challenges in fully utilizing GPU resources across regions
○ Difficult to run overflow training jobs in a different region if local cluster is peaked, as
the data may not be available or may exist in different paths
● Data Silos and Storage cost growing
● Operation overhead to maintain storage organized and under capacity
○ Required coordination across teams to manage and maintain local storage
Challenges of the Previous Architecture

New Architecture with Distributed Cache
ap-region
Data Lake
On Prem
Distributed Cache
Only on Cache Miss
ap-region
us-region
Distributed Cache

Inside Distributed Cache
Worker
Pod
Worker
Pod
Worker
Pod
etcd
Pod
etcd
Pod
etcd
Pod
FUSE
Pod
FUSE
Pod
FUSE
Pod
FUSE
Pod
Training
Job Pod
hostpath:
/mnt/cache-fuse
I/O
Request
Mount Table
&
Membership
Distributed
Cache
Service
Data Lake
Cache Miss

● Instant Data Availability
○ Eliminates lengthy data preparation
■ Training jobs can start immediately without waiting for data to be cached
○ Model developers can still pre-load datasets using the --skip-if-exists flag
■ If already cached, this step is a no-op
○ No coordination required across teams, simplifying the workflow
● Improve GPU Utilization Across Multi-Region
○ Maintains a consistent view of all data paths from the original data lake address, enabling seamless access across regions
○ During peak GPU hours, developers can submit training jobs to an overflow GPU cluster, unmodified, ensuring higher GPU
utilization across multiple regions
● Faster Training Jobs
○ Provides higher performance compared to traditional HPC storage solutions (e.g., AWS FSx), significantly reducing training
time and boosting productivity
New Architecture: Benefits for Model Developers

● Reduced Storage Costs & Operation Overhead
○ Avoids full capacity storage purchases by eliminating duplicate datasets from data lakes
■ Data lake (many PBs) vs cache capacity (TB to PB)
○ No coordination required for cache space cleanup
● Easy Expansion & Operation
○ Seamlessly scale architecture to new GPU clusters without complex reconfiguration
○ Fully managed with Kubernetes (K8s) for simplified deployment, scaling, and maintenance across environments
New Architecture: Benefits for Platform Engineers

THANK YOU
Copyright © 2024 Coupang, Inc. All rights reserved. All Coupang trademarks, Coupang logos and service marks displayed herein are property of Coupang, Inc. and/or its affiliates (collectively, "Coupang"),
registered in the U.S. and other countries. Any other company mentioned herein is merely for identification purposes. Coupang acknowledges that the company name may be a registered trademark of
the company and recognizes that any such trademark is owned solely and exclusively by such company. The information contained herein are based on the author, Hyun Jung Baek's own individual
experience as an employee and are not representative of any views or opinions of Coupang. Coupang has not verified, and it makes no representation as to, the adequacy, fairness, accuracy, or
completeness of any information contained herein.

How Coupang Leverages Distributed Cache to Accelerate ML Model Training

How Coupang Leverages Distributed Cache to Accelerate ML Model Training

More Related Content

Similar to How Coupang Leverages Distributed Cache to Accelerate ML Model Training

More from Alluxio, Inc.

Recently uploaded

How Coupang Leverages Distributed Cache to Accelerate ML Model Training