Efficient Deduplication in ReSys at Scale: How We Managed to Reduce Server Cost by 90% by Andrei Manakov

ScyllaDB 72 views 24 slides Mar 07, 2025
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

Efficient deduplication is key to delivering relevant recommendations at scale. Andrei & Ashish will share how they optimized for 300M+ users, cutting costs by 90% while reducing latency. They cover how ScyllaDB, Apache Flink, Redpanda & Google Dataflow powered this evolution.


Slide Content

A ScyllaDB Community
Efficient Deduplication in
RecSys at Scale:
How ShareChat Reduced Server Costs by 90%
Andrei Manakov
Senior Staff Software Engineer

Andrei Manakov (he/him)
■More than 13 years experience in industry
■Write posts in the blog “Andrei and The
Optimization Stone” about distributed
systems
■Harry Potter Fan


Your photo
goes here,
smile :)

Presentation Agenda
■Sharechat Overview
■What Is A Deduplication In RecSys?
■Initial System State
■Optimisations
■Conclusions

What Is Deduplication In
RecSys?

Initial State

Concept

Components

Optimisations

Cost Efficient Technologies

Streaming jobs

Database

Don’t Waste Your Resources

Domain Driven Optimisation

Domain Driven Optimisation

Domain Driven Optimisation

Storage Size Optimization: why is it important?

Storage Size Optimization: Compression algorithms
■PostIDs are sequential numbers. What are our options?
■Bloom Filter
■Offers pretty nice compression
■Probabilistic in nature, returns false positives
■Impossible to parse postIDs back from the filter
■Roaring Bitmaps
■Offers good compression only when postIDs batch is bigger >300
■Delta Based Compression
■Better compression due to sequential nature of postIDs
■Deterministic and can decompress to get postIDs set

Optimized Compression Schema
This algorithm achieved 60% of storage reduction

Overall Cost: 30% -> 20%

Sent Post Optimization

Sent Post Optimization
- Redis per GB cost was higher than ScyllaDB
- Sent + Viewed posts in the same DB - easier
maintenance

Overall Cost: 20% -> 17%

Cloud Specific Cost Reduction
■Spot Nodes -> Cast AI
■Know your network pricing: Optimize traffic between
availability zones
■Redpanda -> Follower-Fetching
■ScyllaDB
■Token-aware policy
■Rack-aware policy
■K8s
■Topology aware routing
■Service mesh


Overall Cost: 17% -> 10%

Stay in Touch
Andrei Manakov
[email protected]
@AndreyManakov
https://andection.bsky.social/
https://www.linkedin.com/in/andrei-mana
kov-69228a81/
https://andection.substack.com/
Tags