Description
Coupang is a leading e-commerce company in South Korea, with over 50,000 employees and $20+ billion in annual revenue. Coupang's AI platform team builds and manages a large-scale AI platform in AWS for machine learning engineers to train models that enhance and customize product search results and product recommendations for its 100+ million customers.
As the search and recommendation models evolve, optimizing the underlying infrastructure for AI/ML workloads is essential for the e-commerce business. Coupang's platform team actively sought to improve their model training pipeline to boost machine learning engineers' productivity, publish models to production faster, and reduce operational costs.
Coupang focused on addressing several key areas:
- Shortening data preparation and model training time
- Improving GPU utilization in training clusters in different regions
- Reducing S3 API and egress costs incurred from copying large training datasets across regions
- Simplifying the operational complexity of storage system management
In this tech talk, Hyun Jung Baek, Staff Backend Engineer at Coupang, will share best practices for leveraging Alluxio to power search and recommendation model training infrastructure.
Hyun will discuss:
- How Coupang builds a world-class large-scale AI platform for machine learning engineers to deliver better search and recommendation models
- How adding distributed caching to their multi-region AI infrastructure improves GPU utilization, accelerates end-to-end training time, and significantly reduces cross-region data transfer costs.
- How to simplify platform operations and to easily deploy the same architecture to new GPU clusters.
Size: 1014.02 KB
Language: en
Added: Apr 22, 2025
Slides: 13 pages
Slide Content
How Coupang Leverages Distributed
Cache to Accelerate ML Model
Training
April 22, 2025
Hyun Jung Baek, Staff Backend Engineer @ Coupang
Coupang Confidenti al and P ropr ietary
Coupang is a technology and Fortune 200 company listed on
the New York Stock Exchange (NYSE: CPNG) that provides
retail, restaurant delivery, video streaming, and fintech services
to customers around the world under brands that include
Coupang, Coupang Eats, Coupang Play and Farfetch.
Coupang is a Technology and
Fortune 200 Company (NYSE: CPNG)
Coupang Confidenti al and P ropr ietary
Machine Learning Impacts Every Aspect of Commerce
Experiences of Coupang Customers
ProductCatalog Search Pricing
Robotics Inventory Fulfilment
Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172
Coupang Confidenti al and P ropr ietary
Core offerings
•Notebooks & ML Pipeline Authoring
•Model Training
•Model Inference
•Monitoring & Observability
Coupang’s ML Platform Overview
Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172
Coupang Confidenti al and P ropr ietary
Both AWS Multi-Region & On-prem GPU
Clusters
●Cloud GPU clusters across AWS Asia-
Pacific & US regions
●On-prem data center (compute &
storage)
Requirements
●Resource efficiency
○GPU utilization
●High I/O throughput
●Developer experience
●Cloud cost optimization
Hybrid & Multi-Region Compute & Storage Due to GPU Shortage
Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172
Monitoring GPU utilization of Training cluster
Coupang Confidenti al and P ropr ietary
Previous Architecture
ap-region
On Prem
Local Storage
GPU Training Cluster
Data Copy
Data Lake
ap-region
Local Storage
GPU Training Cluster
us-region
Coupang Confidenti al and P ropr ietary
●Required preparation step (copy and validation) before training jobs
○Added day-long delay before training on a dataset
●Challenges in fully utilizing GPU resources across regions
○Difficult to run overflow training jobs in a different region if local cluster is peaked, as
the data may not be available or may exist in different paths
●Data Silos and Storage cost growing
●Operation overhead to maintain storage organized and under capacity
○Required coordination across teams to manage and maintain local storage
Challenges of the Previous Architecture
Coupang Confidenti al and P ropr ietary
New Architecture with Distributed Cache
ap-region
Data Lake
On Prem
Distributed Cache
GPU Training Cluster
Only on Cache Miss
ap-region
GPU Training Cluster
us-region
Distributed Cache
Coupang Confidenti al and P ropr ietary
Inside Distributed Cache
Worker
Pod
Worker
Pod
Worker
Pod
etcd
Pod
etcd
Pod
etcd
Pod
FUSE
Pod
FUSE
Pod
FUSE
Pod
FUSE
Pod
Training
Job Pod
hostpath:
/mnt/cache-fuse
I/O
Request
Mount Table
&
Membership
Distributed
Cache
Service
Data Lake
Cache Miss
Coupang Confidenti al and P ropr ietary
●Instant Data Availability
○Eliminates lengthy data preparation
■Training jobs can start immediately without waiting for data to be cached
○Model developers can still pre-load datasets using the --skip-if-existsflag
■If already cached, this step is a no-op
○No coordination required across teams, simplifying the workflow
●Improve GPU Utilization Across Multi-Region
○Maintains a consistent view of all data paths from the original data lake address, enabling seamless access across regions
○During peak GPU hours, developers can submit training jobs to an overflow GPU cluster, unmodified, ensuring higher GPU
utilization across multiple regions
●Faster Training Jobs
○Provides higher performance compared to traditional HPC storage solutions (e.g., AWS FSx), significantly reducing training
time and boosting productivity
New Architecture: Benefits for Model Developers
Coupang Confidenti al and P ropr ietary
●Reduced Storage Costs & Operation Overhead
○Avoids full capacity storage purchases by eliminating duplicate datasets from data lakes
■Data lake (many PBs) vs cache capacity (TB to PB)
○No coordination required for cache space cleanup
●Easy Expansion & Operation
○Seamlessly scale architecture to new GPU clusters without complex reconfiguration
○Fully managed with Kubernetes (K8s) for simplified deployment, scaling, and maintenance across environments
New Architecture: Benefits for Platform Engineers