How Coupang Leverages Distributed
Cache to Accelerate ML Model
Training
April 22, 2025
Hyun Jung Baek, Staff Backend Engineer @ Coupang
Coupang Confidential and Proprietary
Coupang is a technology and Fortune 200 company listed on
the New York Stock Exchange (NYSE: CPNG) that provides
retail, restaurant delivery, video streaming, and fintech services
to customers around the world under brands that include
Coupang, Coupang Eats, Coupang Play and Farfetch.
Coupang is a Technology and
Fortune 200 Company (NYSE: CPNG)
Coupang Confidential and Proprietary
Machine Learning Impacts Every Aspect of Commerce
Experiences of Coupang Customers
Product Catalog Search Pricing
Robotics Inventory Fulfilment
Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172
Coupang Confidential and Proprietary
Core offerings
• Notebooks & ML Pipeline Authoring
• Model Training
• Model Inference
• Monitoring & Observability
Coupang’s ML Platform Overview
Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172
Coupang Confidential and Proprietary
Both AWS Multi-Region & On-prem GPU
Clusters
● Cloud GPU clusters across AWS Asia-
Pacific & US regions
● On-prem data center (compute &
storage)
Requirements
● Resource efficiency
○ GPU utilization
● High I/O throughput
● Developer experience
● Cloud cost optimization
Hybrid & Multi-Region Compute & Storage Due to GPU Shortage
Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172
Monitoring GPU utilization of Training cluster
Coupang Confidential and Proprietary
Previous Architecture
ap-region
On Prem
Local Storage
GPU Training Cluster
Data Copy
Data Lake
ap-region
Local Storage
GPU Training Cluster
us-region
Coupang Confidential and Proprietary
● Required preparation step (copy and validation) before training jobs
○ Added day-long delay before training on a dataset
● Challenges in fully utilizing GPU resources across regions
○ Difficult to run overflow training jobs in a different region if local cluster is peaked, as
the data may not be available or may exist in different paths
● Data Silos and Storage cost growing
● Operation overhead to maintain storage organized and under capacity
○ Required coordination across teams to manage and maintain local storage
Challenges of the Previous Architecture
Coupang Confidential and Proprietary
New Architecture with Distributed Cache
ap-region
Data Lake
On Prem
Distributed Cache
GPU Training Cluster
Only on Cache Miss
ap-region
GPU Training Cluster
us-region
Distributed Cache
Coupang Confidential and Proprietary
Inside Distributed Cache
Worker
Pod
Worker
Pod
Worker
Pod
etcd
Pod
etcd
Pod
etcd
Pod
FUSE
Pod
FUSE
Pod
FUSE
Pod
FUSE
Pod
Training
Job Pod
hostpath:
/mnt/cache-fuse
I/O
Request
Mount Table
&
Membership
Distributed
Cache
Service
Data Lake
Cache Miss
Coupang Confidential and Proprietary
● Instant Data Availability
○ Eliminates lengthy data preparation
■ Training jobs can start immediately without waiting for data to be cached
○ Model developers can still pre-load datasets using the --skip-if-exists flag
■ If already cached, this step is a no-op
○ No coordination required across teams, simplifying the workflow
● Improve GPU Utilization Across Multi-Region
○ Maintains a consistent view of all data paths from the original data lake address, enabling seamless access across regions
○ During peak GPU hours, developers can submit training jobs to an overflow GPU cluster, unmodified, ensuring higher GPU
utilization across multiple regions
● Faster Training Jobs
○ Provides higher performance compared to traditional HPC storage solutions (e.g., AWS FSx), significantly reducing training
time and boosting productivity
New Architecture: Benefits for Model Developers
Coupang Confidential and Proprietary
● Reduced Storage Costs & Operation Overhead
○ Avoids full capacity storage purchases by eliminating duplicate datasets from data lakes
■ Data lake (many PBs) vs cache capacity (TB to PB)
○ No coordination required for cache space cleanup
● Easy Expansion & Operation
○ Seamlessly scale architecture to new GPU clusters without complex reconfiguration
○ Fully managed with Kubernetes (K8s) for simplified deployment, scaling, and maintenance across environments
New Architecture: Benefits for Platform Engineers
Coupang Confidential and Proprietary
THANK YOU
Copyright © 2024 Coupang, Inc. All rights reserved. All Coupang trademarks, Coupang logos and service marks displayed herein are property of Coupang, Inc. and/or its affiliates (collectively, "Coupang"),
registered in the U.S. and other countries. Any other company mentioned herein is merely for identification purposes. Coupang acknowledges that the company name may be a registered trademark of
the company and recognizes that any such trademark is owned solely and exclusively by such company. The information contained herein are based on the author, Hyun Jung Baek's own individual
experience as an employee and are not representative of any views or opinions of Coupang. Coupang has not verified, and it makes no representation as to, the adequacy, fairness, accuracy, or
completeness of any information contained herein.
How Coupang Leverages Distributed Cache to Accelerate ML Model Training

How Coupang Leverages Distributed Cache to Accelerate ML Model Training

  • 1.
    How Coupang LeveragesDistributed Cache to Accelerate ML Model Training April 22, 2025 Hyun Jung Baek, Staff Backend Engineer @ Coupang
  • 2.
    Coupang Confidential andProprietary Coupang is a technology and Fortune 200 company listed on the New York Stock Exchange (NYSE: CPNG) that provides retail, restaurant delivery, video streaming, and fintech services to customers around the world under brands that include Coupang, Coupang Eats, Coupang Play and Farfetch. Coupang is a Technology and Fortune 200 Company (NYSE: CPNG)
  • 3.
    Coupang Confidential andProprietary Machine Learning Impacts Every Aspect of Commerce Experiences of Coupang Customers Product Catalog Search Pricing Robotics Inventory Fulfilment Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172
  • 4.
    Coupang Confidential andProprietary Core offerings • Notebooks & ML Pipeline Authoring • Model Training • Model Inference • Monitoring & Observability Coupang’s ML Platform Overview Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172
  • 5.
    Coupang Confidential andProprietary Both AWS Multi-Region & On-prem GPU Clusters ● Cloud GPU clusters across AWS Asia- Pacific & US regions ● On-prem data center (compute & storage) Requirements ● Resource efficiency ○ GPU utilization ● High I/O throughput ● Developer experience ● Cloud cost optimization Hybrid & Multi-Region Compute & Storage Due to GPU Shortage Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172 Monitoring GPU utilization of Training cluster
  • 6.
    Coupang Confidential andProprietary Previous Architecture ap-region On Prem Local Storage GPU Training Cluster Data Copy Data Lake ap-region Local Storage GPU Training Cluster us-region
  • 7.
    Coupang Confidential andProprietary ● Required preparation step (copy and validation) before training jobs ○ Added day-long delay before training on a dataset ● Challenges in fully utilizing GPU resources across regions ○ Difficult to run overflow training jobs in a different region if local cluster is peaked, as the data may not be available or may exist in different paths ● Data Silos and Storage cost growing ● Operation overhead to maintain storage organized and under capacity ○ Required coordination across teams to manage and maintain local storage Challenges of the Previous Architecture
  • 8.
    Coupang Confidential andProprietary New Architecture with Distributed Cache ap-region Data Lake On Prem Distributed Cache GPU Training Cluster Only on Cache Miss ap-region GPU Training Cluster us-region Distributed Cache
  • 9.
    Coupang Confidential andProprietary Inside Distributed Cache Worker Pod Worker Pod Worker Pod etcd Pod etcd Pod etcd Pod FUSE Pod FUSE Pod FUSE Pod FUSE Pod Training Job Pod hostpath: /mnt/cache-fuse I/O Request Mount Table & Membership Distributed Cache Service Data Lake Cache Miss
  • 10.
    Coupang Confidential andProprietary ● Instant Data Availability ○ Eliminates lengthy data preparation ■ Training jobs can start immediately without waiting for data to be cached ○ Model developers can still pre-load datasets using the --skip-if-exists flag ■ If already cached, this step is a no-op ○ No coordination required across teams, simplifying the workflow ● Improve GPU Utilization Across Multi-Region ○ Maintains a consistent view of all data paths from the original data lake address, enabling seamless access across regions ○ During peak GPU hours, developers can submit training jobs to an overflow GPU cluster, unmodified, ensuring higher GPU utilization across multiple regions ● Faster Training Jobs ○ Provides higher performance compared to traditional HPC storage solutions (e.g., AWS FSx), significantly reducing training time and boosting productivity New Architecture: Benefits for Model Developers
  • 11.
    Coupang Confidential andProprietary ● Reduced Storage Costs & Operation Overhead ○ Avoids full capacity storage purchases by eliminating duplicate datasets from data lakes ■ Data lake (many PBs) vs cache capacity (TB to PB) ○ No coordination required for cache space cleanup ● Easy Expansion & Operation ○ Seamlessly scale architecture to new GPU clusters without complex reconfiguration ○ Fully managed with Kubernetes (K8s) for simplified deployment, scaling, and maintenance across environments New Architecture: Benefits for Platform Engineers
  • 12.
    Coupang Confidential andProprietary THANK YOU Copyright © 2024 Coupang, Inc. All rights reserved. All Coupang trademarks, Coupang logos and service marks displayed herein are property of Coupang, Inc. and/or its affiliates (collectively, "Coupang"), registered in the U.S. and other countries. Any other company mentioned herein is merely for identification purposes. Coupang acknowledges that the company name may be a registered trademark of the company and recognizes that any such trademark is owned solely and exclusively by such company. The information contained herein are based on the author, Hyun Jung Baek's own individual experience as an employee and are not representative of any views or opinions of Coupang. Coupang has not verified, and it makes no representation as to, the adequacy, fairness, accuracy, or completeness of any information contained herein.