AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendation Applications

Alluxio 409 views 16 slides Mar 11, 2025
Slide 1
Slide 1 of 16
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16

About This Presentation

AI/ML Infra Meetup
Mar. 06, 2025
Organized by Alluxio

For more Alluxio Events: https://www.alluxio.io/events/

Speaker:
- Xu Ning (Director of Engineering, AI Platform @ Snap)

In this talk, Xu Ning from Snap provides a comprehensive overview of the unique challenges in building and scaling recom...


Slide Content

Building Production
Platform for
Large-Scale
Recommendation
Applications
Xu Ning

Snap

About Me
●Director of Engineering, ML Platform at Snap
●(prev.) Uber Michelangelo ML Platform, Horovod Project
●(prev.) big data and infrastructure at Uber, Facebook, Akamai, Microsoft
Bing

Recommendation applications examples
Search and Ads Short Videos Feeds

Example architecture of recommendation
systems
“Embedding-based Retrieval with Two-Tower Models in Spotlight”, Snap Eng Blog, 6/6/2023
100 millions
thousands
10s of thousands
hundreds
a pageful
Approximate nearest
neighbor search
aka “vector search”
Two towers, dot product
Wide-and-deep,
DeepFM, DCN, DLRM,
Transformers
Rule-based
List-wise LTR

Example architecture of recommendation
systems
“Machine Learning for Snapchat Ad Ranking”, Snap Eng Blog, 2/11/2022
Multiple ranking paths
compete at auction

Example recommendation models
1. “Embedding-based Retrieval with Two-Tower Models in Spotlight”, Snap Eng Blog, 6/6/2023
2. “Machine Learning for Snapchat Ad Ranking”, Snap Eng Blog, 2/11/2022
Light ranking “L1” Heavy ranking “L2”

Unique technical challenges in
recommendation systems
●Data intensive
●Large model size and freshness
●High fanout inference

Volume
●DeepSeek V3 trained with 14.6 Trillion Tokens =~60 TB
●Recommendation model at a Snap trained with 1PB data (and continue to be
incrementally trained over time)
●Typically 1-epoch training to prevent overfitting
Variety
●Types: counter, categorical, ID, ID list, embeddings, sequence (array of objects)
●Aggregation dimensions: by entity, by cohort, by category, etc
Velocity
●Trillions of events processed per day in feature pipelines
●Event->available for serving in minutes
RecSys is data intensive
“Introducing Bento, Snap's ML Platform”, Snap Eng Blog, 1/28/2025

Example: Snap’s Robusta real-time feature
platform
“Speed Up Feature Engineering for Recommendation Systems”, Snap Eng Blog, 9/29/2022

Model size: “Scaling law” before it became a
buzzword popularized by LLMs
Meta’s recommendation model, 2024
Meta’s LLaMa 3.1 405b LLM, 2024
“Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters”, Lian et al, 2021

Training large RecSys models
“Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters”, Lian et al, 2021
“Monolith: Real Time Recommendation System With Collisionless Embedding Table”, Liu et al, 2022
99% of weights
DeepFM

How fresh is fresh enough?
“Monolith: Real Time Recommendation System With Collisionless Embedding Table”, Liu et al, 2022

High Fanout Inference
Compiling model for inference: User feature broadcast
●Train: (user_feature, document_feature) →label
●Inference: user_feature, [(document_feature)]
○Need to broadcast user_feature at model compilation or inference server
Document feature fetching
●Each request may need to fetch 10s of 000s document features
○1TB/s read volume
Externalized Embedding serving
●1TB model–cannot fit in memory
●In memory database/serving parameter server
“Introducing Bento, Snap's ML Platform”, Snap Eng Blog, 1/28/2025

Inference and online feature fetching for
RecSys
“Introducing Bento, Snap's ML Platform”, Snap Eng Blog, 1/28/2025

Closing words
●Recommendation systems have unique platform technology and operational
challenges due to scale, and complexity.
●It’s highly customized, and there is no clear cloud/open-source OOTB solution
at scale.
○Kuaishou Persia (unmaintained), ByteDance Monolith (unmaintained)
○Very challenging to adopt
●More on how Snap powers its recommendation applications:
https://eng.snap.com/introducing-bento
??????

Snap ML Platform is hiring!
●Senior Principal Machine Learning Engineer, ML Platform
●Principal Machine Learning Engineer, ML Training Platform
●Principal Machine Learning Engineer, ML Inference Platform
●Principal Software Engineer, Machine Learning Infrastructure
●Manager, Software Engineering, Machine Learning Infrastructure, AI Training Platform
●Manager, Software Engineering, Full Stack
●Machine Learning Engineer, 5+ Years Experience
●Machine Learning Engineer, 3+ Years of Experience
●Staff Machine Learning Engineer, 8+ Years of Experience
●Staff Software Engineer, ML Infrastructure, 9+ Years of Experience
●Software Engineer, ML Infrastructure, 6+ Years of Experience
●Software Engineer, ML Infrastructure, 2+ Years of Experience

https://careers.snap.com/