A Deep Dive into the Seastar Event Loop by Pavel Emelyanov
ScyllaDB
0 views
22 slides
Oct 14, 2025
Slide 1 of 22
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
About This Presentation
The core and the basis of ScyllaDB's outstanding performance is the Seastar framework, and the core and the basis of seastar is its event loop. In this presentation, we'll see what the loop does in great detail, analyze the limitations that it runs in and all the consequences that follow tho...
The core and the basis of ScyllaDB's outstanding performance is the Seastar framework, and the core and the basis of seastar is its event loop. In this presentation, we'll see what the loop does in great detail, analyze the limitations that it runs in and all the consequences that follow those limitations. We'll also learn how the loop is observed by the user and various means to understand its behavior.
Size: 1.03 MB
Language: en
Added: Oct 14, 2025
Slides: 22 pages
Slide Content
A ScyllaDB Community
A Deep Dive into Seastar's
Event Loop
Pavel Emelyanov
Engineer
Pavel Emelyanov
Engineer at ScyllaDB
■Linux containers
■Scylladb “storage team”
■Seastar
Agenda
■Seastar eventloop in a nutshell
■How loop shows itself
■Limitations and the consequences
Architecture at a glance
■One thread per core
●Threads are called “shards”
●Thread-pool thread is an exception
■As little communications between threads as possible
■Keeps Linux as far away as possible
●Networking
●AIO
●Initial memory mappings
●A bit more
Linux
Seastar
ScyllaDB
Main loop
■Runs everything in a loop
●Running tasks
●Kicking side activities
Run tasks
Poll
Running tasks
■Task == lambda function
■Queued per scheduling groups
■Running tasks
●Pick sched group with min vruntime
●Run tasks until needs to preempt
sched
group A
sched
group B
sched
group C Tasks
Polling
■Side activities
●Dispatch and pickup completed AIO
●Poll, send and receive network
●Serve cross-shard communication
●Execution stages
●Timers
Run tasks
Execute stages
Submit IO
Complete IO
Poll SMP
Flush sockets
Run timers
Leisure time
■Seastar can sleep
Run tasks
Execute stages
Submit IO
Complete IO
Poll SMP
Flush sockets
Run timers
Nothing
to do
Sleep until
next event
Reactor ways of debugging
■Linux tools
■Metrics
■Logs
Linux tools
■CPU is (almost) never idle
■strace: Lots of “unrelated” system calls
■RSS is close to 100%, so is VMEM size
Core reactor metrics
■Global and Per scheduling groups
■CPU, memory, IO, network
■Etc.
●SMP
●exceptions
CPU timing metrics
time
run tasks
idling (polling)
sleeping
reactor_cpu_busy_ms
reactor_awake_time_ms_total
reactor_sleep_time_ms_total
CPU timing metrics (advanced)
time
task quota exceeded
sched group wake-up
non-seastar thread runtime
keep running tasks
pending scheduling group
reactor_cpu_steal_time_ms
scheduler_time_spent_in_task_quota_violation
scheduler_starvetime_ms
Stalls? What stalls?
■The task::run() can run arbitrary long time
■Violating queue length threshold
Reactor stalled for 66 ms on shard 0, in scheduling group main .----
Backtrace: 0x5008d9f 0x4ffff3c 0x4fff343 0x1ff1598 0x40fcf 0x17dd2c
0x18a3adc 0x18a1593 0x16b44bb 0x18e08f4 0x21f958d __________________
Too long queue accumulated for sl:default (1029 tasks) ____________________________________________
122: N7seastar8internal21coroutine_traits_baseINS_10shared_ptrIN2db9commitlog7segmentEE …_________
54: N7seastar9coroutine3allIJNS_6futureIvEES3_EE17intermediate_taskILm0EEE _______________________
80: N7seastar12continuationINS_8internal22promise_base_with_typeIvEEZZZZNS_3rpc11recv_helperI …___
4: N7seastar12continuationINS_8internal22promise_base_with_typeINS_3rpc5tupleIJN5query6resultENS …
Task stall pitfalls
■Timer fires at its own expiration, not task-quota time
■The “stall time” is captured by a signal
■The printed call-trace is random in some sense
■It doesn’t show the continuation chain
●Always starts at reactor::run_tasks()
●Contains many inner nameless lambdas
Why stalls are bad?
■Single task or scheduling group occupies CPU
■Other non-CPU activity is not processed either
●Except those, that had been started before
●When finished, freed resource is not re-utilized
How to avoid stalls
■Make code preempt
●co_await maybe_yield()
●Remember to keep races under control
■Avoid exceptions cascading
●Propagating exceptions through co_awaits is very expensive
●2312b7a703cb9c4630c75c713458445abeb26325
Preemption
■Stalls are consequences of voluntary preemption
●Linux OS kernels preempt processes/threads by hardware time
■Preempt by signal?
●Overhead
●Locking problems
Phantom jam
■https://www.scylladb.com/2022/04/19/exploring-phantom-jams-in-your-data-flow/
Phantom jam
■Request latency, depending on request rate
Thank you! Let’s connect.
Pavel Emelyanov [email protected]
github: xemul