Hunting a Kernel Allocation Bug Triggered by io_uring by Raphael Carvalho

ScyllaDB 1 views 26 slides Oct 15, 2025
Slide 1
Slide 1 of 26
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26

About This Presentation

The hunting saga started with a system failure in a ScyllaDB test suite, triggered by its usage of io-uring. We were initially puzzled by it, first blaming the concurrency level. But after going down the rabbit hole, we realized there was more to it. This talk will present the problem and share how...


Slide Content

A ScyllaDB Community
Raphael S. Carvalho
Principal engineer


Hunting a Kernel Allocation
Bug Triggered by io_uring

Raphael S. Carvalho

Principal Engineer at ScyllaDB
■Loves to debug hard problems
■Storage on ScyllaDB
■Loves operating systems
■Swimming on my free time

Oops! the database test failed…
"Disk error: std::system_error (error
system:12, Cannot allocate memory)"

Oops! the database test failed…
■A regression, but since when?!
■Many tests failing with ENOMEM
■Has nothing to do with disk usage
■Error indicates I/O system call returned the error
●Which one?
●io_submit() ?

Time to make some hypothesis
■Cause: high concurrency
■Effect: Memory pressure
■The more concurrent tests, the less memory available for OS
■At which point concurrency increased?

Some relevant background
■ScyllaDB based on seastar framework
■Seastar mlock() user-space memory by default
●Reserves memory for OS duties
■Testing framework
●Disables mlock() to allow for high concurrency and enable swapping
●All I/O is buffered (page cache involved; optimization)

Question based on background
■With swapping, memory pressure shouldn’t be a problem
■All tests will get their memory eventually
■Progress slower, but should complete
■How come a test ran out of memory then???
■Not confident in initial hypothesis
■A kernel bug? Somewhere in linux aio’s io_submit()?

What caused the regression?
■Regression started around December
■Log inspection showed io_uring usage.
■io_uring is not the default I/O backend choice
●Known source of instability compared to linux-aio
■A patch accidentally made some tests pick io_uring
■Moving back to linux-aio made the problem go away
●No more ENOMEM

Some thoughts on the finding
■Why does the test fail with io_uring only?
■A bug in io_uring?
■An interaction of io_uring with file system?
■Let’s find out…

Going down the rabbit hole
■Could report to mailing list and move on with my life
■Had an itch to scratch…
■Where does ENOMEM come from?
■Which system call failed?
■Let’s trace it!

Tracing - which call failed?!
reactor-1-707139 [000] .....
46737.358518: io_uring_submit_req: ring ..., req
..., user_data ..., opcode WRITE, flags 0x200000,
sq_thread 0
...
reactor-1-707139 [000] ...1.
46737.358560: io_uring_complete: ring ..., req
..., user_data ..., result -12, cflags 0x0 extra1
0 extra2 0

Analyzing result
■Error -12 maps to ENOMEM
■Error happens on io_uring completion path
■Error generated when processing the write
●Not allocation failure when queuing the request
■Means file system layer or below
■Puzzling!

Another finding…
■Free memory low, but available memory high
●remember: no mlock()
●Indicates pressure!
■ENOMEM escaping to user space somehow
■System should swap on pressure, yet failing allocation
■Syscall tracing good starting point, but not enough
■Better tracing tool needed…

BPF-based tracing tool: retsnoop
■retsnoop developed exactly for investigating kernel errors
■One can:
●trace kernel functions, e.g.: ./retsnoop -e "*io_uring_enter"
●filter for certain errors, removes noise of successful calls
■Very useful for finding puzzling errors in syscalls
■We know: tests run on XFS and do buffered I/O
■We can: trace xfs_file_buffered_write and look for ENOMEM

Tracing - which function failed?!
entry_SYSCALL_64_after_hwframe+0x76
do_syscall_64+0x82
__do_sys_io_uring_enter+0x265
io_submit_sqes+0x209
io_issue_sqe+0x5b
io_write+0xdd
xfs_file_buffered_write+0x84
! 29us [-ENOMEM] xfs_file_write_iter
iocb=&{.ki_filp=0x...,.ki_complete=0x...,. ki_flags=2359304} from=...
! 27us [-ENOMEM] xfs_file_buffered_write
iocb=&{.ki_filp=0x...,.ki_complete=0x...,. ki_flags=2359304} from=...

Analyzing result
■Not necessarily bug in FS itself, can be another layer involved
■ki_flags=2359304 -> (IOCB_WRITE & IOCB_ALLOC_CACHE & IOCB_NOWAIT)
■IOCB_NOWAIT indicates non-blocking semantics requested
●EAGAIN must be returned instead of waiting on busy resource (mutex, whatever)
■How come some memory allocation failed then?
●Available memory was high
●Why wasn’t EAGAIN returned instead?

Some io_uring background
■First, submits I/O request with non blocking semantics (NOWAIT), known as
fast path.
■Fallback on EAGAIN (i.e. would block) to slow path using worker thread (waits
for busy resource)
■Explains the presence of NOWAIT flag in I/O calls

Time to make a new hypothesis…
■A kernel patch introduced a regression in some allocation path
■Now handling NOWAIT flag incorrectly
■Allowing the ENOMEM error to escape to user space
■Memory allocation docs remembered me of GFP_NOWAIT flag
●Allocation will fail when there’s memory pressure
■Some allocation done with GFP_NOWAIT, failed under pressure
●Error (ENOMEM) not handled correctly with NOWAIT semantics

Tracing - digging a bit more

io_submit_sqes+0x209
io_issue_sqe+0x5b
io_write+0xdd
xfs_file_buffered_write+0x84
iomap_file_buffered_write+0x1a6
32us [-ENOMEM] iomap_write_begin+0x408
iter=&{.inode=0x…,.len=4096,.flags=33,.iomap={.addr=...
! 4us [-ENOMEM] iomap_get_folio
iter=&{.inode=0x…,.len=4096,.flags=33,.iomap={.addr=...

Analyzing result
■Failure happens somewhere when allocating folio (memory pressure)
■.flags=33 -> IOMAP_NOWAIT
■Shows non-blocking semantics requested in this path
■ENOMEM is escaping iomap layer somehow
■iomap_get_folio() performs allocation with GFP_NOWAIT
■How come iomap is not converting ENOMEM into EAGAIN???

Looking for kernel regression…
■In ‘'iomap: Add async buffered write support'’ (2022), Darrick predicted:

"FGP_NOWAIT can cause __filemap_get_folio to return
a NULL folio, which makes iomap_write_begin return
-ENOMEM. If nothing has been written yet, won't
that cause the ENOMEM to escape to userspace? Why
do we want that instead of EAGAIN?"

Came across ‘mm: return an ERR_PTR from __filemap_get_folio’ (2023)
struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
{
...
if (iter->flags & IOMAP_NOWAIT)
fgp |= FGP_NOWAIT;

- folio = __filemap_get_folio(iter->inode->i_mapping, pos >>
PAGE_SHIFT,
+ return __filemap_get_folio(iter->inode->i_mapping, pos >>
PAGE_SHIFT,
fgp, mapping_gfp_mask(iter->inode->i_mapping));
- if (folio)
- return folio;
-
- if (iter->flags & IOMAP_NOWAIT)
- return ERR_PTR(-EAGAIN);
- return ERR_PTR(-ENOMEM);
}

Kernel regression found!
■Realized patch was incorrectly removing the error handling
■__filemap_get_folio() should have handled it in the refactoring
■Aha! Now it all makes sense…
■In io_uring fast path, iomap fails to allocate a folio under mem pressure
■ENOMEM returned instead of EAGAIN
●Prevents io_uring from falling back to slow path
■User space thinks system ran out of memory, despite plenty of available
memory

Patching Linux to fix the regression
■Goal: Restore the proper error handling
-if (err)
+if (err) {
+/*
+ * When NOWAIT I/O fails to allocate folios this could
+ * be due to a nonblocking memory allocation and not
+ * because the system actually is out of memory.
+ * Return -EAGAIN so that there caller retries in a
+ * blocking fashion instead of propagating -ENOMEM
+ * to the application.
+ */
+if ((fgp_flags & FGP_NOWAIT) && err == -ENOMEM)
+err = -EAGAIN;
return ERR_PTR(err);
+}

Conclusion
■The OS can fail
■Blame your application first, later the OS
■You can go down the rabbit hole yourself
■Incredibly rewarding and fun experience
■You will learn a lot in your journey!

Thank you! Let’s connect.
Raphael S. Carvalho
[email protected]
@raphael_scarv
Tags