A Deep Dive into the Seastar Event Loop by Pavel Emelyanov

ScyllaDB 0 views 22 slides Oct 14, 2025
Slide 1
Slide 1 of 22
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22

About This Presentation

The core and the basis of ScyllaDB's outstanding performance is the Seastar framework, and the core and the basis of seastar is its event loop. In this presentation, we'll see what the loop does in great detail, analyze the limitations that it runs in and all the consequences that follow tho...


Slide Content

A ScyllaDB Community
A Deep Dive into Seastar's
Event Loop
Pavel Emelyanov
Engineer

Pavel Emelyanov

Engineer at ScyllaDB
■Linux containers
■Scylladb “storage team”
■Seastar

Agenda
■Seastar eventloop in a nutshell
■How loop shows itself
■Limitations and the consequences

Architecture at a glance
■One thread per core
●Threads are called “shards”
●Thread-pool thread is an exception
■As little communications between threads as possible
■Keeps Linux as far away as possible
●Networking
●AIO
●Initial memory mappings
●A bit more
Linux
Seastar
ScyllaDB

Main loop
■Runs everything in a loop
●Running tasks
●Kicking side activities
Run tasks
Poll

Running tasks
■Task == lambda function
■Queued per scheduling groups
■Running tasks
●Pick sched group with min vruntime
●Run tasks until needs to preempt
sched
group A
sched
group B
sched
group C Tasks

Polling
■Side activities
●Dispatch and pickup completed AIO
●Poll, send and receive network
●Serve cross-shard communication
●Execution stages
●Timers
Run tasks
Execute stages
Submit IO
Complete IO
Poll SMP
Flush sockets
Run timers

Leisure time
■Seastar can sleep
Run tasks
Execute stages
Submit IO
Complete IO
Poll SMP
Flush sockets
Run timers
Nothing
to do
Sleep until
next event

Reactor ways of debugging
■Linux tools
■Metrics
■Logs

Linux tools
■CPU is (almost) never idle
■strace: Lots of “unrelated” system calls
■RSS is close to 100%, so is VMEM size

Core reactor metrics
■Global and Per scheduling groups
■CPU, memory, IO, network
■Etc.
●SMP
●exceptions

CPU timing metrics
time
run tasks
idling (polling)
sleeping
reactor_cpu_busy_ms
reactor_awake_time_ms_total
reactor_sleep_time_ms_total

CPU timing metrics (advanced)
time
task quota exceeded
sched group wake-up
non-seastar thread runtime
keep running tasks
pending scheduling group
reactor_cpu_steal_time_ms
scheduler_time_spent_in_task_quota_violation
scheduler_starvetime_ms

Logged events
■Stalls (CPU)
■Large allocations (memory)
■Delayed requests (IO)

Stalls? What stalls?
■The task::run() can run arbitrary long time




■Violating queue length threshold
Reactor stalled for 66 ms on shard 0, in scheduling group main .----
Backtrace: 0x5008d9f 0x4ffff3c 0x4fff343 0x1ff1598 0x40fcf 0x17dd2c
0x18a3adc 0x18a1593 0x16b44bb 0x18e08f4 0x21f958d __________________
Too long queue accumulated for sl:default (1029 tasks) ____________________________________________
122: N7seastar8internal21coroutine_traits_baseINS_10shared_ptrIN2db9commitlog7segmentEE …_________
54: N7seastar9coroutine3allIJNS_6futureIvEES3_EE17intermediate_taskILm0EEE _______________________
80: N7seastar12continuationINS_8internal22promise_base_with_typeIvEEZZZZNS_3rpc11recv_helperI …___
4: N7seastar12continuationINS_8internal22promise_base_with_typeINS_3rpc5tupleIJN5query6resultENS …

Task stall pitfalls
■Timer fires at its own expiration, not task-quota time
■The “stall time” is captured by a signal
■The printed call-trace is random in some sense
■It doesn’t show the continuation chain
●Always starts at reactor::run_tasks()
●Contains many inner nameless lambdas

Why stalls are bad?
■Single task or scheduling group occupies CPU
■Other non-CPU activity is not processed either
●Except those, that had been started before
●When finished, freed resource is not re-utilized

How to avoid stalls
■Make code preempt
●co_await maybe_yield()
●Remember to keep races under control
■Avoid exceptions cascading
●Propagating exceptions through co_awaits is very expensive
●2312b7a703cb9c4630c75c713458445abeb26325

Preemption
■Stalls are consequences of voluntary preemption
●Linux OS kernels preempt processes/threads by hardware time
■Preempt by signal?
●Overhead
●Locking problems

Phantom jam
■https://www.scylladb.com/2022/04/19/exploring-phantom-jams-in-your-data-flow/

Phantom jam
■Request latency, depending on request rate

Thank you! Let’s connect.
Pavel Emelyanov
[email protected]
github: xemul
Tags