Taming P99 Latencies at Lyft: Tuning Low-Latency Online Feature Stores

ScyllaDB 327 views 35 slides Jun 24, 2024

Slide 1 of 35

About This Presentation

In this talk, we will explore the challenges and strategies of tuning low latency online feature stores to tame the p99 latencies, shedding light on the importance of choosing the right data model. As modern machine learning applications require increasingly fast and efficient feature retrieval/comp...

Size: 4.65 MB

Language: en

Added: Jun 24, 2024

Slides: 35 pages

Slide Content

Taming P99 Latencies at Lyft: Tuning Low-Latency Online Feature Stores Bhanu Renukuntla Senior Software Engineer at Lyft

Bhanu Renukuntla ( He/His) Senior Software Engineer at Lyft Works on Feature Store at Lyft Broadly interested in ML systems Outside work: Travel, Hiking, Cooking

Feature Store at Lyft

Use Cases Applications Fraud detection Pricing Driver Matching Growth and Incentives Location Search Context based location search

Use Cases Feature types Running aggregates (batch & real-time) ➡ int / float Categorical features ➡ boolean /string Geotemporal features ➡ geohash x time Embeddings ➡ vector / list

Feature Store at Lyft - Scale Metric Value Features Managed 2500+ Data Stored 30 TB or 100 B values Peak Read Throughput 100M GETs per min Write Throughput 5B updates per day p50 /p95/p99 GET Latency 8ms/20ms/40ms

Feature Store at Lyft - Architecture

Feature Store at Lyft - Architecture Online Feature Store

Performance issues

What are we measuring? GET API - allows fetching a batch of features Latency - E2E latency observed by the client p50/p99 GET Latency - 8ms/40ms GET KEY [feature_1, feature_2 … feature_n] Example: GET user1 [is_active_rider, has_payment_method, passenger_rating]

Scenario 1: High Batch Size s Scatter-gather approach Client side parallelization + server side parallelization

Scenario 1: High Batch Sizes Batch size > 250 Batch size < 20 p99 GET latency (ms) vs batch_size chart

Scenario 1: High Batch Sizes Solution Maintain high cache hit rate - Higher Redis TTL Cache on the client end

Scenario 2: Geotemporal Features Keys accessed changes every x minutes Difficult to maintain a high hit rate Spikes in p99 latencies KEY: Geohash + hour_of_day + minute_of_hour + …

Scenario 2: Geotemporal Features Solution Use higher time granularity Geohash & hour of the day -> list [minute level features] KEY: Geohash + hour_of_day + minute_of_hour + …

Scenario 2: Geotemporal Features p95 GET latency (ms) chart

Scenario 2: Geotemporal Features Cache hit rate

Scenario 2: Geotemporal Features P50 GET latency chart indicating a spike every hour

Scenario 2: Geotemporal Features p99 GET latency chart indicating drop in spikes p99 latencies p99 latencies one week ago

Scenario 2: Geotemporal Features Solution Use higher time granularity Geohash & hour of the day -> list [minute level features] Async prefetch features to workaround spikes in p99 latencies KEY: Geohash + hour_of_day + minute_of_hour + …

Scenario 3: Client side processing overhead Thick client with client side parallelization is needed for scatter gather. Adds processing overhead or p99 latency spikes due to resource contention Solution: Use more resources OR Tune circuit breakers settings

Scenario 3: Client side processing overhead p99 GET latencies ms Client E2E Service Mesh E2E Client K8s capacity used

Scenario 3: Client side processing overhead p 99 GET latencies ms Client E2E Service Mesh E2E Client K8s capacity used

Summary of performance issues Tiered Storage Scatter gather approach is not performant Effective caching is difficult in some cases Thick client is difficult to manage Impact: high latencies & difficult onboarding

Path Forward

Redesign Effective Scatter-Gather A single low latency persistent store Co-locate features Lightweight client & Offload Scatter-Gather to the server

Redesign

Redesign Redis-only mode Enable efficient storage and faster fetching Redis Hashes Co-locating features String compression and hashing keys Redis Pipelines + MultiGet for effective Scatter-Gather

Acknowledgements Thanks to my colleagues at Lyft for the feedback on this deck. Prem Santosh Udaya Shankar (Manager) Nathanael Ji Maheep Myneni Rohan Varshney Abhinav Agarwal Ravi Kiran Magham Seth Saperstein

Taming P99 Latencies at Lyft: Tuning Low-Latency Online Feature Stores

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Taming P99 Latencies at Lyft: Tuning Low-Latency Online Feature Stores

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx