From 1M to 1B Features Per Second: Scaling ShareChat's ML Feature Store

ScyllaDB 474 views 33 slides Jun 26, 2024
Slide 1
Slide 1 of 33
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33

About This Presentation

ShareChat's Ivan Burmistrov walks through how they built a low latency ML Feature Store based on ScyllaDB which initially failed to meet the scalability requirements and failed on 1 million features per second load, but has been successfully scaled 1000 times to handle 1 billing features per sec...


Slide Content

From 1M to 1B Features Per Second: Scaling ShareChat’s ML Feature Store Ivan Burmistrov Principal Software Engineer Andrei Manakov Staff Software Engineer ShareChat ShareChat

Context

Moj, a short video app

What are the features anyway?

The story

The architecture

Tiles

Tiles

High-level architecture

Why it failed?

ScyllaDb Schema (Bad) CREATE TABLE features ( entity_id string, tile_time timestamp, feature_name string, value blob, PRIMARY KEY ((entity_id), tile_time, feature_name))

Tiling Configuration (Bad)

Let’s do some math

Optimisations

ScyllaDb Schema (Good) CREATE TABLE features ( entity_id string, tile_time timestamp, features blob, PRIMARY KEY (entity_id, tile_time))

Tiling Configuration (Good)

Compaction Strategy

4x Scylla Would Be Enough?

Cache locality

Consistent hashing

Consistent hashing: ingress nginx.ingress.kubernetes.io/upstream-hash-by: "$bucket-value" nginx.ingress.kubernetes.io/upstream-hash-by-subset: "true" nginx.ingress.kubernetes.io/upstream-hash-by-subset-size: "3"

Consistent hashing: ingress

Consistent hashing: subset

Consistent hashing ingress: path rewriting nginx.ingress.kubernetes.io/use-regex: "true" nginx.ingress.kubernetes.io/rewrite-target: /$1 nginx.ingress.kubernetes.io/upstream-hash-by: $2 hosts: - host: feature-service.internal paths: - path: /(method)(/hash-by-.*)

Consistent hashing: ingress

Improve cache locality: 27 deployments

Improve cache locality: 27 deployments

What’s next?

Envoy proxy

Feature Service Optimisation

Envoy proxy - result

Conclusion Robust proven technologies pay off (ScyllaDB, Flink,...) Every next step is harder than previous one The simplest and practical solution does work The most optimized solution isn’t human-friendly Don’t be scared to fork a lib and adjust it for your system

Ivan Burmistrov burmistrov.ivan @gmail.com @isburmistrov [email protected] Thank you! Let’s connect. Andrei Manakov andection @gmail.com @AndreyManakov andection@threads
Tags