Fast and Deterministic Full Table Scans at Scale by Felipe Cardeneti Mendes

ScyllaDB 0 views 15 slides Oct 09, 2025
Slide 1
Slide 1 of 15
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15

About This Presentation

ScyllaDB’s new tablet replication algorithm replaces static vNodes with dynamic, elastic data distribution that adapts to shifting workloads. This talk discusses how tablets enable fast, predictable full table scans by keeping operations shard-local, balancing load automatically, and scaling linea...


Slide Content

A ScyllaDB Community
Fast and Deterministic Full
Table Scans at Scale
Felipe Cardeneti Mendes
Technical Director

Felipe Cardeneti Mendes (he/him)

Technical Director at ScyllaDB
■Vibe-coded the simulator for this preso
■Predictability is important
■Previous speaker on Caching :-)
■Father of a 3yo toddler (puppy)

Full Scans

Wrong Way: Single Query
■Single coordinator
■Limited parallelism
■Doesn't exploit concurrency
SELECT * FROM tbl WHERE <...>
Coordinator work

vNodes Way: Non-deterministic
Which value to pick for :X and :Y?
■Between -2^63 to +2^63 - 1
●Traverse it → done.
■Large token boundary
●Fewer requests, more data
●Higher coordination
■Smaller token boundary
●More requests, fewer data
●Less coordination

SELECT * FROM tbl WHERE token(key) >= :X AND
token(key) < :Y

vNodes Way: system.size_estimates
■Per-table estimate of replica owned vNodes
■Non-deterministic
●Prone to estimate skews
●Range size is partitionsCount * meanPartitionSize
■Used by tools like Spark and Trino
●Manual splitSize tuning
■Divides rangeSize into subranges
Token Ranges

Estimates aren't enough
■Sparse tables waste work
●Most ranges contain little or no data
> SELECT (...) FROM size_estimates WHERE keyspace_name='k' AND table_name='t'

table_name | range_start | range_end | mean_partition_size | partitions_count
-----------+----------------------+----------------------+---------------------+------------------
t | -1078580004477237357 | -966880618140703446 | 2048 | 1250
t | -1085604003861837200 | -1078580004477237357 | 32768 | 156
t | -1117853426191590572 | -1085604003861837200 | 1024 | 3200
t | -1144611754287717220 | -1117853426191590572 | 1048576 | 1000000
t | -1198247555952586603 | -1144611754287717220 | 128 | 10
(...)
■High-density tables require fine-tuning
●Prone to uneven load and contention

Tablets

Recap
A
C
B
C
A
B
■Single unit of replication in ScyllaDB
■Abstraction: Smaller table "fragments"
■Span a contiguous token range
■Dynamically shrink/expand (geometric avg size)

system.tablets – A layer of indirection
■Tablets table
○Each (table, tablet) has its own token range → (node, shard) mapping
○Mapping can change independently of node addition and removal
■A new node can be added without any owned data!
○Different tables can have different tablet counts
○Tablet counts change, tablet size on disk remains relatively constant

system.tablets
Query
Replica
Set
Token
> SELECT … FROM system.tablets WHERE …;

last_token | tablet_count | replicas
----------------------+--------------+-----------------
-9079256848778919937 | 128 | [(replica1, 0), (replica2, 0)]
-8935141660703064065 | 128 | [(replica2, 1), (replica3, 1)]
-8791026472627208193 | 128 | [(replica3, 0), (replica2, 0)]
(...)

Revisited (and Simplified) Logic
■Clients parse system.tablets (retrieve existing tablet mapping)
■Tablets spanning the same replica-shards get grouped and split together
■Workers set a routingKey for requests
SELECT * FROM tbl WHERE token(key) >= :X AND
token(key) < :Y

Implementation

Multi-Stage Queues
■Coordinator
●Retrieves actual tablet → (node, shard) mapping
●Handles splitting/balancing to a corresponding shard queue
■Scheduler
●Least Loaded Replica Balancing
■Workers
●Dispatches requests
●Handle concurrency (inflight: 2)
●Applies backpressure

Summary
■Vibe-coded simulator: github.com/fee-mendes/tablet-fullscans
■Assumes homogeneous topologies
●Mixed-shard clusters left as an exercise :)
■Mimics jitter (sleeps)
■Relatively constant time

Thank you! Let’s connect.
Felipe Mendes
[email protected]
@felipemendes.dev
scylladb.com
Tags