How Coupang Leverages Distributed Cache to Accelerate ML Model Training

Alluxio 952 views 13 slides Apr 22, 2025
Slide 1
Slide 1 of 13
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13

About This Presentation

Alluxio Tech Talk Webinar
Apr. 22, 2025
Organized by Alluxio

For more Alluxio Events: https://www.alluxio.io/events/

Speaker:
- Hyun Jung Baek (Staff Backend Engineer @ Coupang)

Description
Coupang is a leading e-commerce company in South Korea, with over 50,000 employees and $20+ billion in ann...


Slide Content

How Coupang Leverages Distributed
Cache to Accelerate ML Model
Training
April 22, 2025
Hyun Jung Baek, Staff Backend Engineer @ Coupang

Coupang Confidenti al and P ropr ietary
Coupang is a technology and Fortune 200 company listed on
the New York Stock Exchange (NYSE: CPNG) that provides
retail, restaurant delivery, video streaming, and fintech services
to customers around the world under brands that include
Coupang, Coupang Eats, Coupang Play and Farfetch.
Coupang is a Technology and
Fortune 200 Company (NYSE: CPNG)

Coupang Confidenti al and P ropr ietary
Machine Learning Impacts Every Aspect of Commerce
Experiences of Coupang Customers
ProductCatalog Search Pricing
Robotics Inventory Fulfilment
Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172

Coupang Confidenti al and P ropr ietary
Core offerings
•Notebooks & ML Pipeline Authoring
•Model Training
•Model Inference
•Monitoring & Observability
Coupang’s ML Platform Overview
Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172

Coupang Confidenti al and P ropr ietary
Both AWS Multi-Region & On-prem GPU
Clusters
●Cloud GPU clusters across AWS Asia-
Pacific & US regions
●On-prem data center (compute &
storage)
Requirements
●Resource efficiency
○GPU utilization
●High I/O throughput
●Developer experience
●Cloud cost optimization
Hybrid & Multi-Region Compute & Storage Due to GPU Shortage
Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172
Monitoring GPU utilization of Training cluster

Coupang Confidenti al and P ropr ietary
Previous Architecture
ap-region
On Prem
Local Storage
GPU Training Cluster
Data Copy
Data Lake
ap-region
Local Storage
GPU Training Cluster
us-region

Coupang Confidenti al and P ropr ietary
●Required preparation step (copy and validation) before training jobs
○Added day-long delay before training on a dataset
●Challenges in fully utilizing GPU resources across regions
○Difficult to run overflow training jobs in a different region if local cluster is peaked, as
the data may not be available or may exist in different paths
●Data Silos and Storage cost growing
●Operation overhead to maintain storage organized and under capacity
○Required coordination across teams to manage and maintain local storage
Challenges of the Previous Architecture

Coupang Confidenti al and P ropr ietary
New Architecture with Distributed Cache
ap-region
Data Lake
On Prem
Distributed Cache
GPU Training Cluster
Only on Cache Miss
ap-region
GPU Training Cluster
us-region
Distributed Cache

Coupang Confidenti al and P ropr ietary
Inside Distributed Cache
Worker
Pod
Worker
Pod
Worker
Pod
etcd
Pod
etcd
Pod
etcd
Pod
FUSE
Pod
FUSE
Pod
FUSE
Pod
FUSE
Pod
Training
Job Pod
hostpath:
/mnt/cache-fuse
I/O
Request
Mount Table
&
Membership
Distributed
Cache
Service
Data Lake
Cache Miss

Coupang Confidenti al and P ropr ietary
●Instant Data Availability
○Eliminates lengthy data preparation
■Training jobs can start immediately without waiting for data to be cached
○Model developers can still pre-load datasets using the --skip-if-existsflag
■If already cached, this step is a no-op
○No coordination required across teams, simplifying the workflow
●Improve GPU Utilization Across Multi-Region
○Maintains a consistent view of all data paths from the original data lake address, enabling seamless access across regions
○During peak GPU hours, developers can submit training jobs to an overflow GPU cluster, unmodified, ensuring higher GPU
utilization across multiple regions
●Faster Training Jobs
○Provides higher performance compared to traditional HPC storage solutions (e.g., AWS FSx), significantly reducing training
time and boosting productivity
New Architecture: Benefits for Model Developers

Coupang Confidenti al and P ropr ietary
●Reduced Storage Costs & Operation Overhead
○Avoids full capacity storage purchases by eliminating duplicate datasets from data lakes
■Data lake (many PBs) vs cache capacity (TB to PB)
○No coordination required for cache space cleanup
●Easy Expansion & Operation
○Seamlessly scale architecture to new GPU clusters without complex reconfiguration
○Fully managed with Kubernetes (K8s) for simplified deployment, scaling, and maintenance across environments
New Architecture: Benefits for Platform Engineers

Coupang Confidenti al and P ropr ietary
THANK YOU
Copyright © 2024 Coupang, Inc. All rights reserved. All Coupang trademarks, Coupang logos and service marks displayed herein areproperty of Coupang, Inc. and/or its affiliates (collectively, "Coupang"),
registered in the U.S. and other countries. Any other company mentioned herein is merely for identification purposes. Coupangacknowledges that the company name may be a registered trademark of
the company and recognizes that any such trademark is owned solely and exclusively by such company. The information containedherein are based on the author, Hyun Jung Baek's own individual
experience as an employee and are not representative of any views or opinions of Coupang. Coupang has not verified, and it makesno representation as to, the adequacy, fairness, accuracy, or
completeness of any information contained herein.