Building Low Latency ML Systems for Real-Time Model Predictions at Xandr
ScyllaDB
269 views
47 slides
Jun 25, 2024
Slide 1 of 47
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
About This Presentation
Xandr's Ad-server handles over 400 billion daily ad requests from across the world wide web. Operating under a stringent Service Level Agreement (SLA), the majority of these requests are catered to within a 100-150 millisecond round-trip latency through an intricate ad auction process, each invo...
Xandr's Ad-server handles over 400 billion daily ad requests from across the world wide web. Operating under a stringent Service Level Agreement (SLA), the majority of these requests are catered to within a 100-150 millisecond round-trip latency through an intricate ad auction process, each involving hundreds of competing advertisers. Key stages in this process, such as audience targeting, optimization of advertiser objectives, and ad selection are executed utilizing an assortment of sophisticated ML algorithms. Inferencing ML models in real-time and rendering predictions at such an unparalleled scale under the precise SLAs of an ad auction necessitates a resilient and prompt machine learning system.
In this session, I will discuss the challenges of building such a machine learning system that is characterized by low latency to support the high volume and high throughput demands of ad serving. I will cover how Xandr built an extensible, scalable system to supply real-time predictions integral to the ad auction process, leveraging ML models trained frequently on large amounts of constantly updating ad transaction data. I will also share the lessons learned from building such systems, including how to optimize performance, reduce latency, and ensure reliability.
Size: 4.25 MB
Language: en
Added: Jun 25, 2024
Slides: 47 pages
Slide Content
Building Low Latency ML Systems to serve Real-Time Model Predictions at Scale for Online Advertising Chinmay Nerurkar Principal Engineer Microsoft Dr. Moussa Taifi Senior Software Engineer Microsoft
Act 1 – AdTech and AL/ML, a typical relationship AdTech can’t live without ML/AI, can’t live with the latency
What does Xandr do? Xandr provides a premium advertising experience to internet users across digital, mobile and CTV media, globally Demand Side Platform – Enable advertisers to serve relevant Ads to their target users and achieve their advertising goals (RoAS) Supply Side Platform – Allow publishers to supply high-quality inventory where users are served relevant Ads Provide internet users with a seamless, privacy-compliant, premium online Ad experience
How is an Ad served? User visits ESPN website Browser sends Ad request to the Ad Server Ad server collects contextual and audience data and solicits bids from bidding engines Early-stage ranking narrows down Ad candidates based on targeting, audience params. Final-stage ranking calculates bid values for Ads Appropriate Ad creatives are selected for each Ad candidate Ad Server runs an auction and the winning Ad is served to the user on the ESPN site.
Where does AI/ML come in?
What’s the Big Deal? Distributed Ad Serving system with 1000+ of instances Bidding Engines and Ad Servers running globally Most Ad Auctions conducted in under 150 milliseconds Ad Server sees about 600 Billion ad requests daily Ad Server conducts around 400 Billion auctions daily 10s of Billions of Ad transactions are done daily Serving Premium Advertising experience globally, at scale, in real-time is serious business and AI/ML power every step along the way!
Why is ML latency an issue? Slow loading websites / Apps ruins browsing experience, frustrate users and may turn them away from the website / App Advertisers are unable to reach their target audiences and fail to achieve their advertising goals Publishers lose significant revenue due to loss of page-views and ad impressions served AI/ML is crucial to deliver a premium Ad experience for internet users, performance for advertisers and revenue for publishers. This makes ML Latency the archenemy of AdTech!
Act 2 – Common Abstraction Layers for ML Serving Latency
PSA of the day FOCUS ON THE ML SERVING LATENCY FIRST, THAT IS WHAT THE CLIENT SEES FIRST.
We are here
Online vs Offline
Online Prediction Is Where Latency Really Hurts! Generating ad recommendations Ranking search results Predicting if a critical piece of equipment will fail in the next few seconds Predicting the grocery delivery time based on contextual information
Synchronous Online Predictions == Pain ahead
Serving vs Constructing Predictions Prediction Construction - This is where you reduce the time it takes a model to construct predictions from a fully formed, well-behaving, enriched and massaged prediction request. Prediction Serving - This is where the rest of the latency lives. This includes any pre-computing, pre-processing, enriching, massaging of input prediction events as well as any post-processing, caching, and optimizing the delivery of the output predictions.
Only So Much Can Be Done To Optimize The Model Itself
Core Model Components: Optimizing vs Satisficing Metrics Optimizing Metrics Model predictive ability. MAP, MRR, Accuracy, Precision, MSE, Log Loss Each new high or new low, in the corresponding direction of each metric, is a win in the offline world. Satisficing Metrics Is the model going to fit on my device in terms of storage size? Can the model run with the type of CPUs on the device? Does it require GPUs? Can the feature preprocessing finish within specific time bounds? Does the model prediction satisfy the latency limits that our use case requires?
PSA #2: Only So Much Can Be Done To Optimize The Model Itself Realize that a 10% improvement in the latency of the model construction will get crushed by distributed I/O
We are here
Features Are The Real Drag In This Whole Operation Fetching and processing the features becomes a high-stakes operation where the bulk of the prediction serving latency will come from.
3 Model Inputs Types Categories User-supplied features: These come directly from the request. Static reference features: These are infrequently updated values. Dynamic real-time features: These values will come from other data streams. They are processed and made available continuously as new contextual data arrives. Increasing level of complexity! Choose wisely. Start small. Measure the latency, iterate.
We are here
Don’t Forget The Predictions! i.e. Model Outputs
PSA#3: “If all the techniques covered so far still do not make your prediction latency low enough, then the next optimizations you need are precomputing and caching predictions.”
Precomputing predictions? But what about the lookup keys? Entity case: the prediction service receives a known entity ID. That ID represents a domain entity. Combination of feature values case: We might only receive a combination of location, current shopping cart size, segment information, and product category.
Remember That Predicting On Combinations of Feature Values Gets Expensive Quickly Example: country, device_type, and song_category Generate a hash (county, device_type, song_category) as the key. The order is important here: a hash(country, device_type, song_category) will differ from a hash(song_category, country, device_type) . Pick a particular feature order and stick with it. If you serve in 10 countries, 2 device types, and 40 song categories, then that would mean you make 10x2x40=800 cached predictions.
Remember That Predicting On Combinations of Feature Values Gets Expensive Quickly After deciding on the key, you precompute predictions for each key. Store each key-value in a low-read-latency DB, and you are good to go. Even with a solid key-value store, you still need to reduce the number of predictions stored by reducing the number of possible keys. Use the optimizing vs. satisficing metric method here as well. Keep adding categories, features, and keys while the model’s predictive performance increases. But stop when the prediction latency starts to complain.
We are here
Caching Predictions: The Very Special Case Of Real-time Similarity Matching Train a model on the products’ similarity using product-user interactions or product-product co-location. Extract the embeddings of the products. Build an index of the embeddings using an approximate nearest neighbor method. Load the index in the ML prediction service. Use the index at prediction time to retrieve the similar product IDs. Periodically update the index to keep things fresh and relevant. Reference Libraries: Annoy , Scann , etc…
Caching? “But what do I do if the index is too large and the prediction latency is too high?” Reduce the embeddings sizes to get a smaller index until the optimizing metric starts complaining. If you can’t get an acceptable optimizing+satisficing tradeoff, look elsewhere.
Tricks for low-latency caching Four things to keep in mind: The DB will have lots of rows, but only a few columns. Choose a DB that handles single key lookups well. Keep an eye on the categories’ cardinality and the number of keys generated. Monitor the cardinality and raise alarms if you get a spike in new categories to count. That will prevent blowing up the DB lookup latency. Continuous values are going to need to be bucketized. That’s going to be a hyper-parameter that you need to tune. Any technique that can be used to lower the cardinality of categories is your friend. Lower the cardinality as much as your optimizing metric allows.
We are here
Act 3 – AdTech ML Latency Cases
ML Models for Ads Ranking & Bidding
Case 1 – Early-Stage Ads Ranking Goal: Shortlist top Ad candidates from a large universe of Ads given an Ad request for Final State Ads Ranking and, the Ad auction Model: Two Tower DNN / GNN model to build embeddings for Ads and the context of the Ad request. Trained on Ads metadata and contextual features in Ad requests Hidden layers capture interactions between input features which yields superior performance compared to conventional models Computationally expensive to build Model updated infrequently, every few days for new Ads & contextual features
GNNs & Similarity Search – Naïve Approach Steps to predict matching Ads given an Ad Request Extract contextual features from Ad Request and construct contextual embedding vectors Perform an exact similarity search for matching Ad embeddings in the embedding storage (model store) Look-up Ads for matching Ad embeddings and send them to Bidding engines for Final-stage Ads ranking. Best results with 100% recall Creating contextual embeddings in real time, Ads lookup, network latency between multiple systems leads to high prediction serving costs Exact similarity search is expensive and slow leading to high prediction construction costs
GNNs & Similarity Search – ANN Improvement 1 - Approximate Nearest Neighbor Search Approximate results, instead of exact, by limiting similarity search to a small, local embedding neighborhood Modern Vector DBs support ANN using proven algorithms like LSH, HNSW, FAISS Fast searches lower prediction construction costs Very high recall with the right ANN search algorithm Prediction serving costs are still high at AdTech scale
GNNs & Similarity Search – ANN + Caching Improvement #2 – Fast Cache for matching Ads by given an Ad Request DNN / GNN models are updated infrequently, contextual & Ad embeddings remain unchanged Extracting contextual features and building an embedding for each Ad request consumes time Repeated ANN searches for same set of contextual features is also wasteful Precompute and cache mappings from contextual feature sets to matching Ads to minimize total model serving costs
Early-Stage Ads Ranking – GNN w. ANN + Caching
Case 2 – Final-Stage Ads Ranking Goal: Further shortlist Ads eligible to participate in Ad auction and calculate corresponding bid price Model: Predictive models like Logistic Regression used to compute P(Click) & P(Conv) given an Ad Request Trained for all Ads in the system using the Ad’s historical data Served on a smaller set of Ads shortlisted in early-stage Ads Ranking Models use 15+ features from an Ad request to compute the probability of a Click or a Conversion, in real-time Since multiple Ads will bid in the Ad auction, each Ad request results in multiple model evaluations.
Logistic Regression Model for P(click) / P(conv) Caching will be impractical here because High cardinality of input features (Permutations of 15+ features) Models are updated every few hours, cached results will have to be updated too Can be implemented as ML Service API that Bidding engines but, Model Serving Costs >>> Prediction Construction Optimal Design choice is to move Model Prediction Construction closer to bid calculation logic, i.e. inside Bidding Engines to minimize Model Serving Costs
Final-Stage Ads Ranking – In-Proc Model Evaluation
Case 3 – Ad Creative Selection Goal: Select the best Ad Creative for the Ad Request Model: Reinforcement Learning on historical transactions and Ad Creative metadata Served for small number of Ads selected in Final-stage Ad ranking for bidding into the auction Advertisers update Ad creatives and metadata frequently Ad creatives vary by geographical locations and have different models Reinforcement Learning models for Ad Creative Selection are continuously updated based on historical performance
RL Model for Ad Creative Selection Precomputing and caching model predictions is impractical because RL model is being continuously updated based on real-time feedback Ad Creatives & metadata are changed frequently, requiring model updates Moving RL model serving into Bidding Engines is non-ideal because Ad Creatives & Metadata can consume significant memory Evaluating multiple RL models per Ad request and constantly updating RL models with reinforcement feedback is resource intensive Optimal Solution - Model Serving API service for Ad Creative Models Horizontally scalable to construct predictions fast while handling continuous reinforcement updates to RL models Low prediction serving costs for a small set of Ads selected for bidding into the auction
Ad Creative Selection – Model Serving as an API
Act 4 – Conclusion
Conclusion Focus on the ML serving latency first, that is what the client sees first Optimizing the model will only take you so far, realize that most of your latency costs arise from real-time model input construction & distributed I/O Integrate your model in the hot-path early in the modeling lifecycle, then iterate continually to improve latency stats Don’t discount pre-calculating predictions and caching results as an effective model serving technique