Hunting a Kernel Allocation Bug Triggered by io_uring by Raphael Carvalho

ScyllaDB 1 views 26 slides Oct 15, 2025

Slide 1 of 26

About This Presentation

The hunting saga started with a system failure in a ScyllaDB test suite, triggered by its usage of io-uring. We were initially puzzled by it, first blaming the concurrency level. But after going down the rabbit hole, we realized there was more to it. This talk will present the problem and share how...

Size: 1008.63 KB

Language: en

Added: Oct 15, 2025

Slides: 26 pages

Slide Content

A ScyllaDB Community
Raphael S. Carvalho
Principal engineer

Hunting a Kernel Allocation
Bug Triggered by io_uring

Raphael S. Carvalho

Principal Engineer at ScyllaDB
■Loves to debug hard problems
■Storage on ScyllaDB
■Loves operating systems
■Swimming on my free time

Oops! the database test failed…
"Disk error: std::system_error (error
system:12, Cannot allocate memory)"

Oops! the database test failed…
■A regression, but since when?!
■Many tests failing with ENOMEM
■Has nothing to do with disk usage
■Error indicates I/O system call returned the error
●Which one?
●io_submit() ?

Time to make some hypothesis
■Cause: high concurrency
■Effect: Memory pressure
■The more concurrent tests, the less memory available for OS
■At which point concurrency increased?

Some relevant background
■ScyllaDB based on seastar framework
■Seastar mlock() user-space memory by default
●Reserves memory for OS duties
■Testing framework
●Disables mlock() to allow for high concurrency and enable swapping
●All I/O is buffered (page cache involved; optimization)

Question based on background
■With swapping, memory pressure shouldn’t be a problem
■All tests will get their memory eventually
■Progress slower, but should complete
■How come a test ran out of memory then???
■Not conﬁdent in initial hypothesis
■A kernel bug? Somewhere in linux aio’s io_submit()?

What caused the regression?
■Regression started around December
■Log inspection showed io_uring usage.
■io_uring is not the default I/O backend choice
●Known source of instability compared to linux-aio
■A patch accidentally made some tests pick io_uring
■Moving back to linux-aio made the problem go away
●No more ENOMEM

Some thoughts on the ﬁnding
■Why does the test fail with io_uring only?
■A bug in io_uring?
■An interaction of io_uring with ﬁle system?
■Let’s ﬁnd out…

Going down the rabbit hole
■Could report to mailing list and move on with my life
■Had an itch to scratch…
■Where does ENOMEM come from?
■Which system call failed?
■Let’s trace it!

Tracing - which call failed?!
reactor-1-707139 [000] .....
46737.358518: io_uring_submit_req: ring ..., req
..., user_data ..., opcode WRITE, flags 0x200000,
sq_thread 0
...
reactor-1-707139 [000] ...1.
46737.358560: io_uring_complete: ring ..., req
..., user_data ..., result -12, cflags 0x0 extra1
0 extra2 0

Analyzing result
■Error -12 maps to ENOMEM
■Error happens on io_uring completion path
■Error generated when processing the write
●Not allocation failure when queuing the request
■Means ﬁle system layer or below
■Puzzling!

Another ﬁnding…
■Free memory low, but available memory high
●remember: no mlock()
●Indicates pressure!
■ENOMEM escaping to user space somehow
■System should swap on pressure, yet failing allocation
■Syscall tracing good starting point, but not enough
■Better tracing tool needed…

BPF-based tracing tool: retsnoop
■retsnoop developed exactly for investigating kernel errors
■One can:
●trace kernel functions, e.g.: ./retsnoop -e "*io_uring_enter"
●ﬁlter for certain errors, removes noise of successful calls
■Very useful for ﬁnding puzzling errors in syscalls
■We know: tests run on XFS and do buffered I/O
■We can: trace xfs_file_buffered_write and look for ENOMEM

Tracing - which function failed?!
entry_SYSCALL_64_after_hwframe+0x76
do_syscall_64+0x82
__do_sys_io_uring_enter+0x265
io_submit_sqes+0x209
io_issue_sqe+0x5b
io_write+0xdd
xfs_file_buffered_write+0x84
! 29us [-ENOMEM] xfs_file_write_iter
iocb=&{.ki_filp=0x...,.ki_complete=0x...,. ki_flags=2359304} from=...
! 27us [-ENOMEM] xfs_file_buffered_write
iocb=&{.ki_filp=0x...,.ki_complete=0x...,. ki_flags=2359304} from=...

Analyzing result
■Not necessarily bug in FS itself, can be another layer involved
■ki_flags=2359304 -> (IOCB_WRITE & IOCB_ALLOC_CACHE & IOCB_NOWAIT)
■IOCB_NOWAIT indicates non-blocking semantics requested
●EAGAIN must be returned instead of waiting on busy resource (mutex, whatever)
■How come some memory allocation failed then?
●Available memory was high
●Why wasn’t EAGAIN returned instead?

Some io_uring background
■First, submits I/O request with non blocking semantics (NOWAIT), known as
fast path.
■Fallback on EAGAIN (i.e. would block) to slow path using worker thread (waits
for busy resource)
■Explains the presence of NOWAIT ﬂag in I/O calls

Time to make a new hypothesis…
■A kernel patch introduced a regression in some allocation path
■Now handling NOWAIT ﬂag incorrectly
■Allowing the ENOMEM error to escape to user space
■Memory allocation docs remembered me of GFP_NOWAIT ﬂag
●Allocation will fail when there’s memory pressure
■Some allocation done with GFP_NOWAIT, failed under pressure
●Error (ENOMEM) not handled correctly with NOWAIT semantics

Tracing - digging a bit more
…
io_submit_sqes+0x209
io_issue_sqe+0x5b
io_write+0xdd
xfs_file_buffered_write+0x84
iomap_file_buffered_write+0x1a6
32us [-ENOMEM] iomap_write_begin+0x408
iter=&{.inode=0x…,.len=4096,.flags=33,.iomap={.addr=...
! 4us [-ENOMEM] iomap_get_folio
iter=&{.inode=0x…,.len=4096,.flags=33,.iomap={.addr=...

Analyzing result
■Failure happens somewhere when allocating folio (memory pressure)
■.flags=33 -> IOMAP_NOWAIT
■Shows non-blocking semantics requested in this path
■ENOMEM is escaping iomap layer somehow
■iomap_get_folio() performs allocation with GFP_NOWAIT
■How come iomap is not converting ENOMEM into EAGAIN???

Looking for kernel regression…
■In ‘'iomap: Add async buffered write support'’ (2022), Darrick predicted:

"FGP_NOWAIT can cause __filemap_get_folio to return
a NULL folio, which makes iomap_write_begin return
-ENOMEM. If nothing has been written yet, won't
that cause the ENOMEM to escape to userspace? Why
do we want that instead of EAGAIN?"

Came across ‘mm: return an ERR_PTR from __ﬁlemap_get_folio’ (2023)
struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
{
...
if (iter->flags & IOMAP_NOWAIT)
fgp |= FGP_NOWAIT;

- folio = __filemap_get_folio(iter->inode->i_mapping, pos >>
PAGE_SHIFT,
+ return __filemap_get_folio(iter->inode->i_mapping, pos >>
PAGE_SHIFT,
fgp, mapping_gfp_mask(iter->inode->i_mapping));
- if (folio)
- return folio;
-
- if (iter->flags & IOMAP_NOWAIT)
- return ERR_PTR(-EAGAIN);
- return ERR_PTR(-ENOMEM);
}

Kernel regression found!
■Realized patch was incorrectly removing the error handling
■__ﬁlemap_get_folio() should have handled it in the refactoring
■Aha! Now it all makes sense…
■In io_uring fast path, iomap fails to allocate a folio under mem pressure
■ENOMEM returned instead of EAGAIN
●Prevents io_uring from falling back to slow path
■User space thinks system ran out of memory, despite plenty of available
memory

Patching Linux to ﬁx the regression
■Goal: Restore the proper error handling
-if (err)
+if (err) {
+/*
+ * When NOWAIT I/O fails to allocate folios this could
+ * be due to a nonblocking memory allocation and not
+ * because the system actually is out of memory.
+ * Return -EAGAIN so that there caller retries in a
+ * blocking fashion instead of propagating -ENOMEM
+ * to the application.
+ */
+if ((fgp_flags & FGP_NOWAIT) && err == -ENOMEM)
+err = -EAGAIN;
return ERR_PTR(err);
+}

Conclusion
■The OS can fail
■Blame your application ﬁrst, later the OS
■You can go down the rabbit hole yourself
■Incredibly rewarding and fun experience
■You will learn a lot in your journey!

Thank you! Let’s connect.
Raphael S. Carvalho
[email protected]
@raphael_scarv

Hunting a Kernel Allocation Bug Triggered by io_uring by Raphael Carvalho

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Hunting a Kernel Allocation Bug Triggered by io_uring by Raphael Carvalho

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx