Taming P99 Latencies at Lyft: Tuning Low-Latency Online Feature Stores
ScyllaDB
327 views
35 slides
Jun 24, 2024
Slide 1 of 35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
About This Presentation
In this talk, we will explore the challenges and strategies of tuning low latency online feature stores to tame the p99 latencies, shedding light on the importance of choosing the right data model. As modern machine learning applications require increasingly fast and efficient feature retrieval/comp...
In this talk, we will explore the challenges and strategies of tuning low latency online feature stores to tame the p99 latencies, shedding light on the importance of choosing the right data model. As modern machine learning applications require increasingly fast and efficient feature retrieval/computation, addressing p99 latencies has become a crucial aspect in maintaining system performance and user satisfaction.
We will begin by defining p99 latencies and their impact on online feature stores. We will then delve into common issues that contribute to high p99 latencies and their implications on system performance. By sharing insights into best practices for latency reduction, we will demonstrate the value of optimizing low latency online feature stores for better performance.
Central to our discussion will be the significance of selecting the appropriate data model to meet application demands. We will cover various data model alternatives, highlighting their respective strengths and weaknesses.
Our goal is to empower practitioners to make informed decisions and ultimately achieve a well-tuned, efficient feature store capable of taming p99 latencies.
Size: 4.65 MB
Language: en
Added: Jun 24, 2024
Slides: 35 pages
Slide Content
Taming P99 Latencies at Lyft: Tuning Low-Latency Online Feature Stores Bhanu Renukuntla Senior Software Engineer at Lyft
Bhanu Renukuntla ( He/His) Senior Software Engineer at Lyft Works on Feature Store at Lyft Broadly interested in ML systems Outside work: Travel, Hiking, Cooking
Feature Store at Lyft
Use Cases Applications Fraud detection Pricing Driver Matching Growth and Incentives Location Search Context based location search
Use Cases Feature types Running aggregates (batch & real-time) ➡ int / float Categorical features ➡ boolean /string Geotemporal features ➡ geohash x time Embeddings ➡ vector / list
Feature Store at Lyft - Scale Metric Value Features Managed 2500+ Data Stored 30 TB or 100 B values Peak Read Throughput 100M GETs per min Write Throughput 5B updates per day p50 /p95/p99 GET Latency 8ms/20ms/40ms
Feature Store at Lyft - Architecture
Feature Store at Lyft - Architecture
Feature Store at Lyft - Architecture Online Feature Store
Feature Store at Lyft - Architecture Online Feature Store
Feature Store at Lyft - Architecture Online Feature Store
Feature Store at Lyft - Architecture Online Feature Store
Feature Store at Lyft - Architecture Online Feature Store
Feature Store at Lyft - Architecture Online Feature Store
Performance issues
What are we measuring? GET API - allows fetching a batch of features Latency - E2E latency observed by the client p50/p99 GET Latency - 8ms/40ms GET KEY [feature_1, feature_2 … feature_n] Example: GET user1 [is_active_rider, has_payment_method, passenger_rating]
Scenario 1: High Batch Size s Scatter-gather approach Client side parallelization + server side parallelization
Scenario 1: High Batch Sizes Batch size > 250 Batch size < 20 p99 GET latency (ms) vs batch_size chart
Scenario 1: High Batch Sizes Solution Maintain high cache hit rate - Higher Redis TTL Cache on the client end
Scenario 2: Geotemporal Features Keys accessed changes every x minutes Difficult to maintain a high hit rate Spikes in p99 latencies KEY: Geohash + hour_of_day + minute_of_hour + …
Scenario 2: Geotemporal Features Solution Use higher time granularity Geohash & hour of the day -> list [minute level features] KEY: Geohash + hour_of_day + minute_of_hour + …
Scenario 2: Geotemporal Features p95 GET latency (ms) chart
Scenario 2: Geotemporal Features Cache hit rate
Scenario 2: Geotemporal Features P50 GET latency chart indicating a spike every hour
Scenario 2: Geotemporal Features p99 GET latency chart indicating drop in spikes p99 latencies p99 latencies one week ago
Scenario 2: Geotemporal Features Solution Use higher time granularity Geohash & hour of the day -> list [minute level features] Async prefetch features to workaround spikes in p99 latencies KEY: Geohash + hour_of_day + minute_of_hour + …
Scenario 3: Client side processing overhead Thick client with client side parallelization is needed for scatter gather. Adds processing overhead or p99 latency spikes due to resource contention Solution: Use more resources OR Tune circuit breakers settings
Scenario 3: Client side processing overhead p99 GET latencies ms Client E2E Service Mesh E2E Client K8s capacity used
Scenario 3: Client side processing overhead p 99 GET latencies ms Client E2E Service Mesh E2E Client K8s capacity used
Summary of performance issues Tiered Storage Scatter gather approach is not performant Effective caching is difficult in some cases Thick client is difficult to manage Impact: high latencies & difficult onboarding
Path Forward
Redesign Effective Scatter-Gather A single low latency persistent store Co-locate features Lightweight client & Offload Scatter-Gather to the server
Redesign
Redesign Redis-only mode Enable efficient storage and faster fetching Redis Hashes Co-locating features String compression and hashing keys Redis Pipelines + MultiGet for effective Scatter-Gather
Acknowledgements Thanks to my colleagues at Lyft for the feedback on this deck. Prem Santosh Udaya Shankar (Manager) Nathanael Ji Maheep Myneni Rohan Varshney Abhinav Agarwal Ravi Kiran Magham Seth Saperstein