Taming P99 Latencies at Lyft: Tuning Low-Latency Online Feature Stores

ScyllaDB 327 views 35 slides Jun 24, 2024
Slide 1
Slide 1 of 35
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35

About This Presentation

In this talk, we will explore the challenges and strategies of tuning low latency online feature stores to tame the p99 latencies, shedding light on the importance of choosing the right data model. As modern machine learning applications require increasingly fast and efficient feature retrieval/comp...


Slide Content

Taming P99 Latencies at Lyft: Tuning Low-Latency Online Feature Stores Bhanu Renukuntla Senior Software Engineer at Lyft

Bhanu Renukuntla ( He/His) Senior Software Engineer at Lyft Works on Feature Store at Lyft Broadly interested in ML systems Outside work: Travel, Hiking, Cooking

Feature Store at Lyft

Use Cases Applications Fraud detection Pricing Driver Matching Growth and Incentives Location Search Context based location search

Use Cases Feature types Running aggregates (batch & real-time) ➡ int / float Categorical features ➡ boolean /string Geotemporal features ➡ geohash x time Embeddings ➡ vector / list

Feature Store at Lyft - Scale Metric Value Features Managed 2500+ Data Stored 30 TB or 100 B values Peak Read Throughput 100M GETs per min Write Throughput 5B updates per day p50 /p95/p99 GET Latency 8ms/20ms/40ms

Feature Store at Lyft - Architecture

Feature Store at Lyft - Architecture

Feature Store at Lyft - Architecture Online Feature Store

Feature Store at Lyft - Architecture Online Feature Store

Feature Store at Lyft - Architecture Online Feature Store

Feature Store at Lyft - Architecture Online Feature Store

Feature Store at Lyft - Architecture Online Feature Store

Feature Store at Lyft - Architecture Online Feature Store

Performance issues

What are we measuring? GET API - allows fetching a batch of features Latency - E2E latency observed by the client p50/p99 GET Latency - 8ms/40ms GET KEY [feature_1, feature_2 … feature_n] Example: GET user1 [is_active_rider, has_payment_method, passenger_rating]

Scenario 1: High Batch Size s Scatter-gather approach Client side parallelization + server side parallelization

Scenario 1: High Batch Sizes Batch size > 250 Batch size < 20 p99 GET latency (ms) vs batch_size chart

Scenario 1: High Batch Sizes Solution Maintain high cache hit rate - Higher Redis TTL Cache on the client end

Scenario 2: Geotemporal Features Keys accessed changes every x minutes Difficult to maintain a high hit rate Spikes in p99 latencies KEY: Geohash + hour_of_day + minute_of_hour + …

Scenario 2: Geotemporal Features Solution Use higher time granularity Geohash & hour of the day -> list [minute level features] KEY: Geohash + hour_of_day + minute_of_hour + …

Scenario 2: Geotemporal Features p95 GET latency (ms) chart

Scenario 2: Geotemporal Features Cache hit rate

Scenario 2: Geotemporal Features P50 GET latency chart indicating a spike every hour

Scenario 2: Geotemporal Features p99 GET latency chart indicating drop in spikes p99 latencies p99 latencies one week ago

Scenario 2: Geotemporal Features Solution Use higher time granularity Geohash & hour of the day -> list [minute level features] Async prefetch features to workaround spikes in p99 latencies KEY: Geohash + hour_of_day + minute_of_hour + …

Scenario 3: Client side processing overhead Thick client with client side parallelization is needed for scatter gather. Adds processing overhead or p99 latency spikes due to resource contention Solution: Use more resources OR Tune circuit breakers settings

Scenario 3: Client side processing overhead p99 GET latencies ms Client E2E Service Mesh E2E Client K8s capacity used

Scenario 3: Client side processing overhead p 99 GET latencies ms Client E2E Service Mesh E2E Client K8s capacity used

Summary of performance issues Tiered Storage Scatter gather approach is not performant Effective caching is difficult in some cases Thick client is difficult to manage Impact: high latencies & difficult onboarding

Path Forward

Redesign Effective Scatter-Gather A single low latency persistent store Co-locate features Lightweight client & Offload Scatter-Gather to the server

Redesign

Redesign Redis-only mode Enable efficient storage and faster fetching Redis Hashes Co-locating features String compression and hashing keys Redis Pipelines + MultiGet for effective Scatter-Gather

Acknowledgements Thanks to my colleagues at Lyft for the feedback on this deck. Prem Santosh Udaya Shankar (Manager) Nathanael Ji Maheep Myneni Rohan Varshney Abhinav Agarwal Ravi Kiran Magham Seth Saperstein
Tags