How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham

ScyllaDB 279 views 23 slides Mar 12, 2025
Slide 1
Slide 1 of 23
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23

About This Presentation

Learn about Agoda's performance tuning strategies for ScyllaDB. Worakarn shares how they optimized disk performance, fine-tuned compaction strategies, and adjusted SSTable settings to match their workload for peak efficiency.


Slide Content

A ScyllaDB Community
How Agoda Scaled 50x
Throughput with ScyllaDB
Worakarn Isaratham
Lead Software Engineer

Worakarn Isaratham (he/him)
■Lead Software Engineer, Agoda
■Based in Bangkok, Thailand
■Experience in distributed computing,
software testing
■Interested in dependable software systems

■ScyllaDB in Agoda Feature Store
■Capacity Problem
■Potential Solutions
Presentation Agenda

Agoda Feature Store

Online Feature Serving
Client SDK
Cache
ScyllaDB
App Servers
3.5M EPS 1.7M EPS
200k EPS
P99 Latency: 5 msP99 Latency: 8 ms
Average 5 features / entities

Growth
Since the start of 2023
■Servers traffic: 50x
Peak servers traffic, on the busiest DC

Growth
Since the start of 2023
■Servers traffic: 50x
■ScyllaDB traffic: 10x
10K EPS
Peak ScyllaDB traffic, on the busiest DC

A Capacity Problem
■A new use case wanted to onboard
■Problematic usage pattern:
■Bursty traffic from cold cache, hitting ScyllaDB at 120K EPS.
■Many duplicated requests in very quick succession
■Keep retrying any failed requests
12x of the load then
2x of the load now!

A Capacity Problem
■One DC was able to survive this load
without errors.
■The other DC got lots of problems
■Very high error rate
■Took 40 minutes to finish all
the retries
■Metrics were pointing to slow
read on ScyllaDB nodes

Slow Disks
Bad DC Good DC Advantage
Disks SATA SSD
RAID 0
NVMe SSD
RAID 0

Read iops 6868 79566 11.6x
Read
bandwidth
1.5G 10.1G 6.7x
Write iops 6615 41104 6.2x
Write
bandwidth
1.9G 6.3G 3.3x

Just Buy New Disks?
●New disks were ordered
●Improved user-side caching, reduced
this load to 7K.
●How long could we survive?
Capacity

Cache-Avoiding Load Test
■Use artificial, one-time-used load to avoid ScyllaDB caching.
25K 5K
Normal load
ScyllaDB cache
one-time-used entities
BYPASS CACHE
Flush, Restart ScyllaDB
Baseline EPS for SATA

Idea 1: Different Data Modeling
Current: one tall table
Alternative: one table per feature set

Idea 1: Different Data Modeling

Idea 2: Change Compaction Strategy
■Our workload is “Read-mostly, many updates”. Size-tiered strategy is recommended.
Prioritized read latency
Slow disk read
Large SSTable files
Size-tiered
Compaction
Leveled
Compaction

Idea 2: Change Compaction Strategy
1.5x

Idea 3: Increase Summary File Size
■ScyllaDB uses summary files to help navigate to index files
summary file size ≈ data file size × summary ratio
High ratio
Larger
summary
More
efficient
index
Less disk I/O

Idea 3: Increase Summary File Size
4x

NVMe
60x

Rollout
Jul 2023
New summary ratio applied
Oct 2023
Migrated to NVMe disks
Focus shifted to other components.
Still trying out some new ideas on ScyllaDB.
Leveled Compaction:
Only applied to new table,
need data migration

Recent Experiments
●Partitioned By Feature Set, clustered by Entity
○Disastrous! 400x worse
●All features as a blob in a single row
○+35% throughput

Lessons
●Fast disks are essential!
●Benchmark your load
●Tailor your data model to fit the needs

Stay in Touch
Worakarn Isaratham
[email protected]
github.com/arkorwan
www.linkedin.com/in/worakarn
Tags