How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
ScyllaDB
279 views
23 slides
Mar 12, 2025
Slide 1 of 23
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
About This Presentation
Learn about Agoda's performance tuning strategies for ScyllaDB. Worakarn shares how they optimized disk performance, fine-tuned compaction strategies, and adjusted SSTable settings to match their workload for peak efficiency.
Size: 2.04 MB
Language: en
Added: Mar 12, 2025
Slides: 23 pages
Slide Content
A ScyllaDB Community
How Agoda Scaled 50x
Throughput with ScyllaDB
Worakarn Isaratham
Lead Software Engineer
Worakarn Isaratham (he/him)
■Lead Software Engineer, Agoda
■Based in Bangkok, Thailand
■Experience in distributed computing,
software testing
■Interested in dependable software systems
■ScyllaDB in Agoda Feature Store
■Capacity Problem
■Potential Solutions
Presentation Agenda
Agoda Feature Store
Online Feature Serving
Client SDK
Cache
ScyllaDB
App Servers
3.5M EPS 1.7M EPS
200k EPS
P99 Latency: 5 msP99 Latency: 8 ms
Average 5 features / entities
Growth
Since the start of 2023
■Servers traffic: 50x
Peak servers traffic, on the busiest DC
Growth
Since the start of 2023
■Servers traffic: 50x
■ScyllaDB traffic: 10x
10K EPS
Peak ScyllaDB traffic, on the busiest DC
A Capacity Problem
■A new use case wanted to onboard
■Problematic usage pattern:
■Bursty traffic from cold cache, hitting ScyllaDB at 120K EPS.
■Many duplicated requests in very quick succession
■Keep retrying any failed requests
12x of the load then
2x of the load now!
A Capacity Problem
■One DC was able to survive this load
without errors.
■The other DC got lots of problems
■Very high error rate
■Took 40 minutes to finish all
the retries
■Metrics were pointing to slow
read on ScyllaDB nodes
Slow Disks
Bad DC Good DC Advantage
Disks SATA SSD
RAID 0
NVMe SSD
RAID 0
Just Buy New Disks?
●New disks were ordered
●Improved user-side caching, reduced
this load to 7K.
●How long could we survive?
Capacity
Cache-Avoiding Load Test
■Use artificial, one-time-used load to avoid ScyllaDB caching.
25K 5K
Normal load
ScyllaDB cache
one-time-used entities
BYPASS CACHE
Flush, Restart ScyllaDB
Baseline EPS for SATA
Idea 1: Different Data Modeling
Current: one tall table
Alternative: one table per feature set
Idea 1: Different Data Modeling
Idea 2: Change Compaction Strategy
■Our workload is “Read-mostly, many updates”. Size-tiered strategy is recommended.
Prioritized read latency
Slow disk read
Large SSTable files
Size-tiered
Compaction
Leveled
Compaction
Idea 2: Change Compaction Strategy
1.5x
Idea 3: Increase Summary File Size
■ScyllaDB uses summary files to help navigate to index files
summary file size ≈ data file size × summary ratio
High ratio
Larger
summary
More
efficient
index
Less disk I/O
Idea 3: Increase Summary File Size
4x
NVMe
60x
Rollout
Jul 2023
New summary ratio applied
Oct 2023
Migrated to NVMe disks
Focus shifted to other components.
Still trying out some new ideas on ScyllaDB.
Leveled Compaction:
Only applied to new table,
need data migration
Recent Experiments
●Partitioned By Feature Set, clustered by Entity
○Disastrous! 400x worse
●All features as a blob in a single row
○+35% throughput
Lessons
●Fast disks are essential!
●Benchmark your load
●Tailor your data model to fit the needs
Stay in Touch
Worakarn Isaratham [email protected]
github.com/arkorwan
www.linkedin.com/in/worakarn