As Fast as Possible, But Not Faster: ScyllaDB Flow Control by Nadav Har'El

ScyllaDB 0 views 24 slides Oct 14, 2025
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

Pushing requests faster than a system can handle results in rapidly growing queues. If unchecked, it risks depleting memory and system stability. This talk discusses how we engineered ScyllaDB’s flow control for high volume ingestions, allowing it to throttle over-eager clients to exactly the righ...


Slide Content

A ScyllaDB Community
As Fast As Possible, But Not Faster:
ScyllaDB Flow Control
Nadav Har’El
Distinguished Engineer

Nadav Har’El

Distinguished Engineer at ScyllaDB
■Before Scylla, I wrote a Hebrew spell-checker, a
kernel from scratch, and other eclectic stuff.
■My favorite percentile is 99.
■I’m a mathematician in theory, programmer in
practice.
■I have three children (and also a ?????? and a ?????? ).
October 2025

INTRODUCTION

AND GOALS OF THIS TALK

Ingestion
■We look into ingestion of data into a Scylla cluster,
i.e., a large volume of write requests.
■We would like the ingestion to proceed as fast as
possible, but not faster.

“as fast as possible, but not faster”?

■An over-eager client may send writes faster than the
cluster can complete earlier requests.
■We can absorb a short burst of requests in queues.

Allow the client to
continue writing at
excessive rate


Backlog of uncompleted
writes grows until
memory runs out!

Goals of this talk


Understand the
■different causes of queue buildup during writes.
■flow control mechanisms Scylla uses to automatically
cause ingestion to proceed at the optimal pace.
Simulator for experimenting with different workloads and
flow control mechanisms.
https://github.com/nyh/flowsim

Dealing with an over-eager writer


■As the backlog grows, server needs a way to tell the client
to slow down its request rate.
■The CQL protocol does not offer explicit flow-control
mechanisms for the server to slow down a client.
■Two options remain: delaying replies to the client’s
requests, and failing requests.
■Which we can use depends on what drives the workload:

Workload model 1: Fixed concurrency



■Batch workload: Application wishes to write a large batch of
data, as fast as possible (driving the server at 100% utilization).
■A fixed number of client threads, each running a request loop:
prepare some data, make a write request, waiting for response.
■The server can control the request rate by throttling (delaying)
its replies:
●If the server only sends N replies per second, the client will
only send N new requests per second!

Workload model 2: Unbounded concurrency





■Interactive workload: The client sends requests driven by some
external events (e.g., activity of real users).
●Request rate unrelated to completion of previous requests.
●If request rate is above cluster capacity, the server can’t slow down
these requests and the backlog grows and grows.
●To avoid that, we must fail some requests.
■To reach optimum throughput, we should fail fresh client
requests. Admission control.

FLOW CONTROL OF
REGULAR WRITES

Regular writes (i.e., no materialized view)





A coordinator node receives a write and:
■Sends it to to RF (e.g., 3) replicas.
■Waits for CL (e.g., 2) of those writes to complete,
■Then reply to the client: “desired consistency-level achieved”.
■The remaining (e.g., 1) writes to replicas will continue
“in the background” without the client waiting.

Why background writes cause problems







■A batch workload, upon receiving the server’s reply, will send
a new request before these background writes finish.
■If new writes come in faster than we complete background
writes, the number of these background writes can grow
without bound.

Typical cause: a slower node









■3 nodes. One slightly slower:

■RF=3, CL=2.
■Each second:
●10,000 writes are replied when the two faster nodes respond.
●Backlog of background writes to slowest node increases by 100.
10,000 wps10,000 wps9,900 wps
(1% slower)

Regular writes - Scylla’s solution











A simple, but effective, throttling mechanism:
■When total memory used by background writes exceeds some
limit (10% of shard’s memory), the coordinator will only reply after
all RF replica writes have completed. (not after CL).
■The backlog of background writes does not continue to grow.
■Replies are only sent at the rate we can complete all the work, so a
batch workload will slow down its requests to the same rate.

Simulation of “slower node” example













max backlog = 300

FLOW CONTROL OF
MATERIALIZED VIEWS WRITES

Write to a table with materialized views













■As before: coordinator sends writes to RF (e.g., 3) replicas,
waits for only first CL (e.g., 2) of those writes to complete.
■Each replica later sends updates to view table(s) -
The client does not wait for these view updates.
●A deliberate, though often-debated, design choice.

Why view writes cause problems















■Again the problem is background work the client doesn’t await.
■There may be a lot such work - with V views, we typically have
2*V of these background view writes .
■If new writes come in faster than we can finish view updates,
the number of these queued view updates can grow without
bound.

Simulation (problem)















pending view writes
pending base writes
request rate

View updates - solution

















■How to prevent view backlog from growing endlessly?
■Delay each client write by an extra delay.
●Higher delay: slows client, view-update backlog declines.
●Lower delay: speeds-up client, view-update backlog increases.
■Goal: a controller, changing delay as backlog changes, to
keep backlog in desired range.

View updates - solution



















■A simple linear controller:
●delay = α * backlog
■When client concurrency is fixed, will converge on delay
that keeps backlog constant:
●If delay is higher, client slows and backlog declines,
causing delay to go down.

Simulation (solution)



















pending base writes
pending view writes
request rate

Conclusions





















■Scylla flow-controls a batch workload,
●data is ingested as fast as possible - but not faster than that.
■When client cannot be slowed down to the optimal rate,
●Scylla starts dropping new requests, to achieve the highest
possible throughput of successful writes.
■Automatic, no need for user intervention or configuration.

Thank you! Let’s connect.
Nadav Har’El
[email protected]
Tags