Hunting a Kernel Allocation Bug Triggered by io_uring by Raphael Carvalho
ScyllaDB
1 views
26 slides
Oct 15, 2025
Slide 1 of 26
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
About This Presentation
The hunting saga started with a system failure in a ScyllaDB test suite, triggered by its usage of io-uring. We were initially puzzled by it, first blaming the concurrency level. But after going down the rabbit hole, we realized there was more to it. This talk will present the problem and share how...
The hunting saga started with a system failure in a ScyllaDB test suite, triggered by its usage of io-uring. We were initially puzzled by it, first blaming the concurrency level. But after going down the rabbit hole, we realized there was more to it. This talk will present the problem and share how tracing helped us discover a problem in the Linux kernel.
Size: 1008.63 KB
Language: en
Added: Oct 15, 2025
Slides: 26 pages
Slide Content
A ScyllaDB Community
Raphael S. Carvalho
Principal engineer
Hunting a Kernel Allocation
Bug Triggered by io_uring
Raphael S. Carvalho
Principal Engineer at ScyllaDB
■Loves to debug hard problems
■Storage on ScyllaDB
■Loves operating systems
■Swimming on my free time
Oops! the database test failed…
"Disk error: std::system_error (error
system:12, Cannot allocate memory)"
Oops! the database test failed…
■A regression, but since when?!
■Many tests failing with ENOMEM
■Has nothing to do with disk usage
■Error indicates I/O system call returned the error
●Which one?
●io_submit() ?
Time to make some hypothesis
■Cause: high concurrency
■Effect: Memory pressure
■The more concurrent tests, the less memory available for OS
■At which point concurrency increased?
Some relevant background
■ScyllaDB based on seastar framework
■Seastar mlock() user-space memory by default
●Reserves memory for OS duties
■Testing framework
●Disables mlock() to allow for high concurrency and enable swapping
●All I/O is buffered (page cache involved; optimization)
Question based on background
■With swapping, memory pressure shouldn’t be a problem
■All tests will get their memory eventually
■Progress slower, but should complete
■How come a test ran out of memory then???
■Not confident in initial hypothesis
■A kernel bug? Somewhere in linux aio’s io_submit()?
What caused the regression?
■Regression started around December
■Log inspection showed io_uring usage.
■io_uring is not the default I/O backend choice
●Known source of instability compared to linux-aio
■A patch accidentally made some tests pick io_uring
■Moving back to linux-aio made the problem go away
●No more ENOMEM
Some thoughts on the finding
■Why does the test fail with io_uring only?
■A bug in io_uring?
■An interaction of io_uring with file system?
■Let’s find out…
Going down the rabbit hole
■Could report to mailing list and move on with my life
■Had an itch to scratch…
■Where does ENOMEM come from?
■Which system call failed?
■Let’s trace it!
Analyzing result
■Error -12 maps to ENOMEM
■Error happens on io_uring completion path
■Error generated when processing the write
●Not allocation failure when queuing the request
■Means file system layer or below
■Puzzling!
Another finding…
■Free memory low, but available memory high
●remember: no mlock()
●Indicates pressure!
■ENOMEM escaping to user space somehow
■System should swap on pressure, yet failing allocation
■Syscall tracing good starting point, but not enough
■Better tracing tool needed…
BPF-based tracing tool: retsnoop
■retsnoop developed exactly for investigating kernel errors
■One can:
●trace kernel functions, e.g.: ./retsnoop -e "*io_uring_enter"
●filter for certain errors, removes noise of successful calls
■Very useful for finding puzzling errors in syscalls
■We know: tests run on XFS and do buffered I/O
■We can: trace xfs_file_buffered_write and look for ENOMEM
Analyzing result
■Not necessarily bug in FS itself, can be another layer involved
■ki_flags=2359304 -> (IOCB_WRITE & IOCB_ALLOC_CACHE & IOCB_NOWAIT)
■IOCB_NOWAIT indicates non-blocking semantics requested
●EAGAIN must be returned instead of waiting on busy resource (mutex, whatever)
■How come some memory allocation failed then?
●Available memory was high
●Why wasn’t EAGAIN returned instead?
Some io_uring background
■First, submits I/O request with non blocking semantics (NOWAIT), known as
fast path.
■Fallback on EAGAIN (i.e. would block) to slow path using worker thread (waits
for busy resource)
■Explains the presence of NOWAIT flag in I/O calls
Time to make a new hypothesis…
■A kernel patch introduced a regression in some allocation path
■Now handling NOWAIT flag incorrectly
■Allowing the ENOMEM error to escape to user space
■Memory allocation docs remembered me of GFP_NOWAIT flag
●Allocation will fail when there’s memory pressure
■Some allocation done with GFP_NOWAIT, failed under pressure
●Error (ENOMEM) not handled correctly with NOWAIT semantics
Tracing - digging a bit more
…
io_submit_sqes+0x209
io_issue_sqe+0x5b
io_write+0xdd
xfs_file_buffered_write+0x84
iomap_file_buffered_write+0x1a6
32us [-ENOMEM] iomap_write_begin+0x408
iter=&{.inode=0x…,.len=4096,.flags=33,.iomap={.addr=...
! 4us [-ENOMEM] iomap_get_folio
iter=&{.inode=0x…,.len=4096,.flags=33,.iomap={.addr=...
Analyzing result
■Failure happens somewhere when allocating folio (memory pressure)
■.flags=33 -> IOMAP_NOWAIT
■Shows non-blocking semantics requested in this path
■ENOMEM is escaping iomap layer somehow
■iomap_get_folio() performs allocation with GFP_NOWAIT
■How come iomap is not converting ENOMEM into EAGAIN???
"FGP_NOWAIT can cause __filemap_get_folio to return
a NULL folio, which makes iomap_write_begin return
-ENOMEM. If nothing has been written yet, won't
that cause the ENOMEM to escape to userspace? Why
do we want that instead of EAGAIN?"
Came across ‘mm: return an ERR_PTR from __filemap_get_folio’ (2023)
struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
{
...
if (iter->flags & IOMAP_NOWAIT)
fgp |= FGP_NOWAIT;
Kernel regression found!
■Realized patch was incorrectly removing the error handling
■__filemap_get_folio() should have handled it in the refactoring
■Aha! Now it all makes sense…
■In io_uring fast path, iomap fails to allocate a folio under mem pressure
■ENOMEM returned instead of EAGAIN
●Prevents io_uring from falling back to slow path
■User space thinks system ran out of memory, despite plenty of available
memory
Patching Linux to fix the regression
■Goal: Restore the proper error handling
-if (err)
+if (err) {
+/*
+ * When NOWAIT I/O fails to allocate folios this could
+ * be due to a nonblocking memory allocation and not
+ * because the system actually is out of memory.
+ * Return -EAGAIN so that there caller retries in a
+ * blocking fashion instead of propagating -ENOMEM
+ * to the application.
+ */
+if ((fgp_flags & FGP_NOWAIT) && err == -ENOMEM)
+err = -EAGAIN;
return ERR_PTR(err);
+}
Conclusion
■The OS can fail
■Blame your application first, later the OS
■You can go down the rabbit hole yourself
■Incredibly rewarding and fun experience
■You will learn a lot in your journey!
Thank you! Let’s connect.
Raphael S. Carvalho [email protected]
@raphael_scarv