AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendation Applications
Alluxio
409 views
16 slides
Mar 11, 2025
Slide 1 of 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
About This Presentation
AI/ML Infra Meetup
Mar. 06, 2025
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Xu Ning (Director of Engineering, AI Platform @ Snap)
In this talk, Xu Ning from Snap provides a comprehensive overview of the unique challenges in building and scaling recom...
AI/ML Infra Meetup
Mar. 06, 2025
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Xu Ning (Director of Engineering, AI Platform @ Snap)
In this talk, Xu Ning from Snap provides a comprehensive overview of the unique challenges in building and scaling recommendation systems compared to LLM applications.
Size: 1.18 MB
Language: en
Added: Mar 11, 2025
Slides: 16 pages
Slide Content
Building Production
Platform for
Large-Scale
Recommendation
Applications
Xu Ning
Snap
About Me
●Director of Engineering, ML Platform at Snap
●(prev.) Uber Michelangelo ML Platform, Horovod Project
●(prev.) big data and infrastructure at Uber, Facebook, Akamai, Microsoft
Bing
Recommendation applications examples
Search and Ads Short Videos Feeds
Example architecture of recommendation
systems
“Embedding-based Retrieval with Two-Tower Models in Spotlight”, Snap Eng Blog, 6/6/2023
100 millions
thousands
10s of thousands
hundreds
a pageful
Approximate nearest
neighbor search
aka “vector search”
Two towers, dot product
Wide-and-deep,
DeepFM, DCN, DLRM,
Transformers
Rule-based
List-wise LTR
Example architecture of recommendation
systems
“Machine Learning for Snapchat Ad Ranking”, Snap Eng Blog, 2/11/2022
Multiple ranking paths
compete at auction
Example recommendation models
1. “Embedding-based Retrieval with Two-Tower Models in Spotlight”, Snap Eng Blog, 6/6/2023
2. “Machine Learning for Snapchat Ad Ranking”, Snap Eng Blog, 2/11/2022
Light ranking “L1” Heavy ranking “L2”
Unique technical challenges in
recommendation systems
●Data intensive
●Large model size and freshness
●High fanout inference
Volume
●DeepSeek V3 trained with 14.6 Trillion Tokens =~60 TB
●Recommendation model at a Snap trained with 1PB data (and continue to be
incrementally trained over time)
●Typically 1-epoch training to prevent overfitting
Variety
●Types: counter, categorical, ID, ID list, embeddings, sequence (array of objects)
●Aggregation dimensions: by entity, by cohort, by category, etc
Velocity
●Trillions of events processed per day in feature pipelines
●Event->available for serving in minutes
RecSys is data intensive
“Introducing Bento, Snap's ML Platform”, Snap Eng Blog, 1/28/2025
Example: Snap’s Robusta real-time feature
platform
“Speed Up Feature Engineering for Recommendation Systems”, Snap Eng Blog, 9/29/2022
Model size: “Scaling law” before it became a
buzzword popularized by LLMs
Meta’s recommendation model, 2024
Meta’s LLaMa 3.1 405b LLM, 2024
“Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters”, Lian et al, 2021
Training large RecSys models
“Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters”, Lian et al, 2021
“Monolith: Real Time Recommendation System With Collisionless Embedding Table”, Liu et al, 2022
99% of weights
DeepFM
How fresh is fresh enough?
“Monolith: Real Time Recommendation System With Collisionless Embedding Table”, Liu et al, 2022
High Fanout Inference
Compiling model for inference: User feature broadcast
●Train: (user_feature, document_feature) →label
●Inference: user_feature, [(document_feature)]
○Need to broadcast user_feature at model compilation or inference server
Document feature fetching
●Each request may need to fetch 10s of 000s document features
○1TB/s read volume
Externalized Embedding serving
●1TB model–cannot fit in memory
●In memory database/serving parameter server
“Introducing Bento, Snap's ML Platform”, Snap Eng Blog, 1/28/2025
Inference and online feature fetching for
RecSys
“Introducing Bento, Snap's ML Platform”, Snap Eng Blog, 1/28/2025
Closing words
●Recommendation systems have unique platform technology and operational
challenges due to scale, and complexity.
●It’s highly customized, and there is no clear cloud/open-source OOTB solution
at scale.
○Kuaishou Persia (unmaintained), ByteDance Monolith (unmaintained)
○Very challenging to adopt
●More on how Snap powers its recommendation applications:
https://eng.snap.com/introducing-bento
??????
Snap ML Platform is hiring!
●Senior Principal Machine Learning Engineer, ML Platform
●Principal Machine Learning Engineer, ML Training Platform
●Principal Machine Learning Engineer, ML Inference Platform
●Principal Software Engineer, Machine Learning Infrastructure
●Manager, Software Engineering, Machine Learning Infrastructure, AI Training Platform
●Manager, Software Engineering, Full Stack
●Machine Learning Engineer, 5+ Years Experience
●Machine Learning Engineer, 3+ Years of Experience
●Staff Machine Learning Engineer, 8+ Years of Experience
●Staff Software Engineer, ML Infrastructure, 9+ Years of Experience
●Software Engineer, ML Infrastructure, 6+ Years of Experience
●Software Engineer, ML Infrastructure, 2+ Years of Experience