Building Low Latency ML Systems for Real-Time Model Predictions at Xandr

Building Low Latency ML Systems to serve Real-Time Model Predictions at Scale for Online Advertising Chinmay Nerurkar Principal Engineer Microsoft Dr. Moussa Taifi Senior Software Engineer Microsoft

Act 1 – AdTech and AL/ML, a typical relationship AdTech can’t live without ML/AI, can’t live with the latency

What does Xandr do? Xandr provides a premium advertising experience to internet users across digital, mobile and CTV media, globally Demand Side Platform – Enable advertisers to serve relevant Ads to their target users and achieve their advertising goals (RoAS) Supply Side Platform – Allow publishers to supply high-quality inventory where users are served relevant Ads Provide internet users with a seamless, privacy-compliant, premium online Ad experience

How is an Ad served? User visits ESPN website Browser sends Ad request to the Ad Server Ad server collects contextual and audience data and solicits bids from bidding engines Early-stage ranking narrows down Ad candidates based on targeting, audience params. Final-stage ranking calculates bid values for Ads Appropriate Ad creatives are selected for each Ad candidate Ad Server runs an auction and the winning Ad is served to the user on the ESPN site.

Where does AI/ML come in?

What’s the Big Deal? Distributed Ad Serving system with 1000+ of instances Bidding Engines and Ad Servers running globally Most Ad Auctions conducted in under 150 milliseconds Ad Server sees about 600 Billion ad requests daily Ad Server conducts around 400 Billion auctions daily 10s of Billions of Ad transactions are done daily Serving Premium Advertising experience globally, at scale, in real-time is serious business and AI/ML power every step along the way!

Why is ML latency an issue? Slow loading websites / Apps ruins browsing experience, frustrate users and may turn them away from the website / App Advertisers are unable to reach their target audiences and fail to achieve their advertising goals Publishers lose significant revenue due to loss of page-views and ad impressions served AI/ML is crucial to deliver a premium Ad experience for internet users, performance for advertisers and revenue for publishers. This makes ML Latency the archenemy of AdTech!

Act 2 – Common Abstraction Layers for ML Serving Latency

PSA of the day FOCUS ON THE ML SERVING LATENCY FIRST, THAT IS WHAT THE CLIENT SEES FIRST.

We are here

Online vs Offline

Online Prediction Is Where Latency Really Hurts! Generating ad recommendations Ranking search results Predicting if a critical piece of equipment will fail in the next few seconds Predicting the grocery delivery time based on contextual information

Synchronous Online Predictions == Pain ahead

Serving vs Constructing Predictions Prediction Construction - This is where you reduce the time it takes a model to construct predictions from a fully formed, well-behaving, enriched and massaged prediction request. Prediction Serving - This is where the rest of the latency lives. This includes any pre-computing, pre-processing, enriching, massaging of input prediction events as well as any post-processing, caching, and optimizing the delivery of the output predictions.

Only So Much Can Be Done To Optimize The Model Itself

Core Model Components: Optimizing vs Satisficing Metrics Optimizing Metrics Model predictive ability. MAP, MRR, Accuracy, Precision, MSE, Log Loss Each new high or new low, in the corresponding direction of each metric, is a win in the offline world. Satisficing Metrics Is the model going to fit on my device in terms of storage size? Can the model run with the type of CPUs on the device? Does it require GPUs? Can the feature preprocessing finish within specific time bounds? Does the model prediction satisfy the latency limits that our use case requires?

PSA #2: Only So Much Can Be Done To Optimize The Model Itself Realize that a 10% improvement in the latency of the model construction will get crushed by distributed I/O

We are here

Features Are The Real Drag In This Whole Operation Fetching and processing the features becomes a high-stakes operation where the bulk of the prediction serving latency will come from.

3 Model Inputs Types Categories User-supplied features: These come directly from the request. Static reference features: These are infrequently updated values. Dynamic real-time features: These values will come from other data streams. They are processed and made available continuously as new contextual data arrives. Increasing level of complexity! Choose wisely. Start small. Measure the latency, iterate.

We are here

Don’t Forget The Predictions! i.e. Model Outputs

PSA#3: “If all the techniques covered so far still do not make your prediction latency low enough, then the next optimizations you need are precomputing and caching predictions.”

Precomputing predictions? But what about the lookup keys? Entity case: the prediction service receives a known entity ID. That ID represents a domain entity. Combination of feature values case: We might only receive a combination of location, current shopping cart size, segment information, and product category.

Remember That Predicting On Combinations of Feature Values Gets Expensive Quickly Example: country, device_type, and song_category Generate a hash (county, device_type, song_category) as the key. The order is important here: a hash(country, device_type, song_category) will differ from a hash(song_category, country, device_type) . Pick a particular feature order and stick with it. If you serve in 10 countries, 2 device types, and 40 song categories, then that would mean you make 10x2x40=800 cached predictions.

Remember That Predicting On Combinations of Feature Values Gets Expensive Quickly After deciding on the key, you precompute predictions for each key. Store each key-value in a low-read-latency DB, and you are good to go. Even with a solid key-value store, you still need to reduce the number of predictions stored by reducing the number of possible keys. Use the optimizing vs. satisficing metric method here as well. Keep adding categories, features, and keys while the model’s predictive performance increases. But stop when the prediction latency starts to complain.

We are here

Caching Predictions: The Very Special Case Of Real-time Similarity Matching Train a model on the products’ similarity using product-user interactions or product-product co-location. Extract the embeddings of the products. Build an index of the embeddings using an approximate nearest neighbor method. Load the index in the ML prediction service. Use the index at prediction time to retrieve the similar product IDs. Periodically update the index to keep things fresh and relevant. Reference Libraries: Annoy , Scann , etc…

Caching? “But what do I do if the index is too large and the prediction latency is too high?” Reduce the embeddings sizes to get a smaller index until the optimizing metric starts complaining. If you can’t get an acceptable optimizing+satisficing tradeoff, look elsewhere.

Tricks for low-latency caching Four things to keep in mind: The DB will have lots of rows, but only a few columns. Choose a DB that handles single key lookups well. Keep an eye on the categories’ cardinality and the number of keys generated. Monitor the cardinality and raise alarms if you get a spike in new categories to count. That will prevent blowing up the DB lookup latency. Continuous values are going to need to be bucketized. That’s going to be a hyper-parameter that you need to tune. Any technique that can be used to lower the cardinality of categories is your friend. Lower the cardinality as much as your optimizing metric allows.

We are here

Act 3 – AdTech ML Latency Cases

ML Models for Ads Ranking & Bidding

Case 1 – Early-Stage Ads Ranking Goal: Shortlist top Ad candidates from a large universe of Ads given an Ad request for Final State Ads Ranking and, the Ad auction Model: Two Tower DNN / GNN model to build embeddings for Ads and the context of the Ad request. Trained on Ads metadata and contextual features in Ad requests Hidden layers capture interactions between input features which yields superior performance compared to conventional models Computationally expensive to build Model updated infrequently, every few days for new Ads & contextual features

GNNs & Similarity Search – Naïve Approach Steps to predict matching Ads given an Ad Request Extract contextual features from Ad Request and construct contextual embedding vectors Perform an exact similarity search for matching Ad embeddings in the embedding storage (model store) Look-up Ads for matching Ad embeddings and send them to Bidding engines for Final-stage Ads ranking. Best results with 100% recall Creating contextual embeddings in real time, Ads lookup, network latency between multiple systems leads to high prediction serving costs Exact similarity search is expensive and slow leading to high prediction construction costs

GNNs & Similarity Search – ANN Improvement 1 - Approximate Nearest Neighbor Search Approximate results, instead of exact, by limiting similarity search to a small, local embedding neighborhood Modern Vector DBs support ANN using proven algorithms like LSH, HNSW, FAISS Fast searches lower prediction construction costs Very high recall with the right ANN search algorithm Prediction serving costs are still high at AdTech scale

GNNs & Similarity Search – ANN + Caching Improvement #2 – Fast Cache for matching Ads by given an Ad Request DNN / GNN models are updated infrequently, contextual & Ad embeddings remain unchanged Extracting contextual features and building an embedding for each Ad request consumes time Repeated ANN searches for same set of contextual features is also wasteful Precompute and cache mappings from contextual feature sets to matching Ads to minimize total model serving costs

Early-Stage Ads Ranking – GNN w. ANN + Caching

Case 2 – Final-Stage Ads Ranking Goal: Further shortlist Ads eligible to participate in Ad auction and calculate corresponding bid price Model: Predictive models like Logistic Regression used to compute P(Click) & P(Conv) given an Ad Request Trained for all Ads in the system using the Ad’s historical data Served on a smaller set of Ads shortlisted in early-stage Ads Ranking Models use 15+ features from an Ad request to compute the probability of a Click or a Conversion, in real-time Since multiple Ads will bid in the Ad auction, each Ad request results in multiple model evaluations.

Logistic Regression Model for P(click) / P(conv) Caching will be impractical here because High cardinality of input features (Permutations of 15+ features) Models are updated every few hours, cached results will have to be updated too Can be implemented as ML Service API that Bidding engines but, Model Serving Costs >>> Prediction Construction Optimal Design choice is to move Model Prediction Construction closer to bid calculation logic, i.e. inside Bidding Engines to minimize Model Serving Costs

Final-Stage Ads Ranking – In-Proc Model Evaluation

Case 3 – Ad Creative Selection Goal: Select the best Ad Creative for the Ad Request Model: Reinforcement Learning on historical transactions and Ad Creative metadata Served for small number of Ads selected in Final-stage Ad ranking for bidding into the auction Advertisers update Ad creatives and metadata frequently Ad creatives vary by geographical locations and have different models Reinforcement Learning models for Ad Creative Selection are continuously updated based on historical performance

RL Model for Ad Creative Selection Precomputing and caching model predictions is impractical because RL model is being continuously updated based on real-time feedback Ad Creatives & metadata are changed frequently, requiring model updates Moving RL model serving into Bidding Engines is non-ideal because Ad Creatives & Metadata can consume significant memory Evaluating multiple RL models per Ad request and constantly updating RL models with reinforcement feedback is resource intensive Optimal Solution - Model Serving API service for Ad Creative Models Horizontally scalable to construct predictions fast while handling continuous reinforcement updates to RL models Low prediction serving costs for a small set of Ads selected for bidding into the auction

Ad Creative Selection – Model Serving as an API

Act 4 – Conclusion

Conclusion Focus on the ML serving latency first, that is what the client sees first Optimizing the model will only take you so far, realize that most of your latency costs arise from real-time model input construction & distributed I/O Integrate your model in the hot-path early in the modeling lifecycle, then iterate continually to improve latency stats Don’t discount pre-calculating predictions and caching results as an effective model serving technique

Chinmay Abhay Nerurkar [email protected] https://www.linkedin.com/in/nchinmay/ Thank you! Let’s connect. Dr. Moussa Taifi [email protected] https://www.moussataifi.com/

Building Low Latency ML Systems for Real-Time Model Predictions at Xandr

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Building Low Latency ML Systems for Real-Time Model Predictions at Xandr

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx