Reworking the Neon IO stack: Rust+tokio+io_uring+O_DIRECT by Christian Schwarz

ScyllaDB 0 views 27 slides Oct 08, 2025
Slide 1
Slide 1 of 27
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27

About This Presentation

Neon is a serverless Postgres platform. Recently acquired by Databricks, the same technology now also powers Databricks Lakebase. In this talk, we will dive into Pageserver, the multi-tenant storage service at the heart of the architecture. We share techniques and lessons learned from reworking its ...


Slide Content

A ScyllaDB Community
Reworking the Neon IO stack:
Rust+tokio+io_uring+O_DIRECT
Christian Schwarz
Member of Technical Staff

Christian Schwarz

Member of Technical Staff
■Interest in foundational systems software:
databases, filesystems, OS, hypervisor
■ZFS: zrepl; master’s thesis; full time @ Nutanix
■Fall 2022: joined
■May 2025: acquisition by Databricks
⇒ now powering Databricks Lakebase

Neon Architecture

Separation Of
Storage &
ComputeCompute = vanilla
*
Postgres inside
resizable virtual machines
*
: where vanilla PG accesses the
filesystem, in Neon our extension
makes RPCs to Neon Storage
This talk: evolution of caching and
IO on the read path
More: Architecture docs , CMU talk

Pageserver
Like a filesystem:
serve 8 KiB page images to PG

Unlike a filesystem:
materialize the page images
from the WAL!
More: blog post

Pageserver
LSN
Page
#2 a
init
page #2
insert
tuple into
page #2
#3 b
init
page #3
insert
tuple into
page #3
c
insert
tuple into
page #2
#2
a
#3
b
c
WAL stream
Custom-built temporal
key-value store.
Key = (Page,LSN)
Value = Image | WAL record

Pageserver
Custom-built temporal key-value store.
Organizes WAL records by the 8k page
number they affect.
Key = (Page,LSN)
Value = Image | WAL record
LSN
Page
#2 a
init
page #2
insert
tuple into
page #2
#3 b
init
page #3
insert
tuple into
page #3
c
insert
tuple into
page #2
#2
a
#3
b
c
WAL stream

Pageserver
Layer map
Layer file
Layer file: (~SSTables) groups
kv pairs; separation of keys
and values; packed btree
index tells us the value offset.
Layer map: custom indexing
data structure for efficient
lookups (blog for details).
Persistence in object storage.
LSN
Page
#2
a
#3
b
c

GetPage@LSN
= fold(walredo, [image, rec, ..., rec])
Collect everything on a “page
vertical” from request LSN until we
find an image.
Walredo uses vanilla PG wal redo
functions to replay collected
records against the image.

?????? Result: page image as if we had
replayed the vanilla WAL stream up
to the requested LSN.
walredo#2ac
a,c
LSN
Page
#2
a
#3
b
c

DRAM vs IOPs
Sequential traversal is unacceptable if some % reads
need to do IO: latency ~ O(#records)
Where can we spend DRAM effectively to improve
latency?
Data : makes no sense in Pageserver DRAM.
Hottest data is always cached in Compute DRAM
(as a materialized page, in shared_buffers)
⇒ Cache in Pageserver only benefits SQL-level latency
if Compute is undersized.
Also: Pageserver size(NVMe) >> size(DRAM)
⇒ can’t cache all data ⇒ tail latency won’t improve.
Metadata: put all of it in DRAM!
Traversal (Index, 1, 3) requires no IO.
Data IOs (2,4,5) issued during traversal, wait for
completions only once at the end.
GetPage latency ~
O(traversal) + O(NVMe latency)
LSN
Page
#2
a
#3
c
b
Metadata layer map + in-layer index Data records & full page images
1
2
3
4
5
Exploit the IO parallelism offered by
local NVMe storage!

Caching Bigger Picture
OLTP: index lookup +
point query
OLAP-like /
sequential-scans
User Workloads
(simplified!)
Pageserver
Compute shared_buffers + LFC
heap
index
Compute memory: auto-scaled ; LRU caching of hottest
data (indices + low-cardinality relations).
Random heap page reads ⇒ GetPage request.
Latency-bound; single-digit millisecond (cf prev slide)
☠ ☠☠☠
Throughput >> latency.
Tune Postgres prefetch (effective_io_concurrency).
Batched traversal: amortized traversal, coalesce IO
(massively benefits from cached metadata).

How We Got There

Early Days
March
2021
Aug
2021
March
2022
July
2022
first
commit
first version of
our custom
storage format
second format
rewrite
Series A
RocksDB scaffold
No userspace cache.
Tiny buffered read syscalls.

Small userspace 8KiB block cache.
On miss: 8KiB synchronous read
syscall (buffered IO)
⇒ kernel page cache is load-bearing
Clean up the
prototype
shortcuts!

Enter Async Rust
Pageserver stored all layers locally & was
running out of NVMe capacity.
Solution: evict cold layers, re-download
on-demand during traversal.
Async Rust + tokio make this seamless!
Shortcut: the core IO path hasn’t
changed, still issuing 8k synchronous
read syscalls, but now from tokio
executor threads ?????? !

Executor Stalls
Ramp up tenant density, enabled by the
work we just did.
Symptom: clients see tput/latency knee,
but server hardware is nowhere near fully
USEd.
Explanation: we’re packing more hot data
⇒ larger working set ⇒ kernel page cache
hit rate plummets.
Lesson: don’t take shortcuts ?????? ;
awareness of what your doesn’t see.

The Plan

The Plan
Step 1
Stick with buffered IO (too
much implicit dependence).
But: use io_uring to do the
buffered IO async ⇒ we no
longer stall the executor.

Step 2
Get caching and IO under full
userspace control using
O_DIRECT.

How we use io_uring to avoid executor stalls
Constraints: no forking, no switch other
runtime, no sidecar thread/runtime. We
want a pure-play solution that can be
used from vanilla tokio!
Solution: tokio-epoll-uring
io_uring on top of vanilla tokio!

How we use io_uring to avoid executor stalls
Use a piece of shared state as sqe user_data to model in-flight op.
submission: io_uring_enter(..., min_complete=0, flags=0)
Guaranteed to never block the caller.
But Page cache hits are still served in-line of the syscall.
Check completion queue on return anyways.
⇒ Page cache hit: behavior ~ synchronous read syscall.
⇒ Would block: no cqe; store waker & return Poll::Pending

Learn Rust async fundamentals here or attend a workshop by my
colleague Conrad Ludgate.
0x55555575e260
waker
owned
buffer

How we use io_uring to avoid executor stalls
completion & wake-up of blocking submissions
?????? The fd that represents the io_uring is epoll’able!
Hook it up with tokio’s internal epoll instance, using
tokio::io::unix::AsyncFd
Spawn a background tokio task that does:

0x55555575e260
waker
owned
buffer

Performance
Working Set Throughput Bottleneck
Fits in Kernel Page Cache 5x (7x if no fairness yield) CPU (⇒ more efficient)
2x Kernel Page Cache 97% of baseline Disk IOPS limit
Question: does this solve today’s problem. Sorry, not much academic rigor.
Setup: i4i.2xlarge; multi-threaded runtime; N tasks; each random 8k reads 100 MiB file.
Baseline: tokio::fs
Verdict: good enough, ship it!

Step 2
Get caching and IO under full
userspace control using
O_DIRECT.
The Plan
Step 1
Stick with buffered IO (too
much implicit dependence).
But: use io_uring to do the
buffered IO async ⇒ we no
longer stall the executor.✅

Why not use Buffered IO / Kernel Page Cache?
Double caching overheads between userspace cache &
kernel (DRAM, memcpy)
No provisions for multi-tenancy (quotas, limits).
Poor observability, esp wrt multi-tenancy.
Dirty pages = IO debt = your malloc can now block!
No control over caching policy.
Remember: all metadata in DRAM; no data in DRAM.
#2
a
#3
c
b

O_DIRECT
So, we made Pageserver ready for O_DIRECT for all reads and writes.
Execution: deliver incrementally over ~1 year (details).
For metadata: continue using 8 KiB userspace cache; size for 99.95% hit rate.
Data reads: always from disk; lean into IO parallelism of the local NVMe.
Ext4 metadata: 100% kernel cached. This unlocks io_uring submission to submit
block device IOs, without bouncing to a kernel worker.
Finished in May 2025!

Conclusion
“Cache in DRAM” vs “Spend IOPS” now under full userspace control.
tokio-epoll-uring drives all our IOs, using io_uring, from vanilla tokio!
We leverage all this control to implement a purpose-fit caching strategy.

Special thanks to Arpad Müller, Yuchen Liang, Vlad Lazar.
Future work:
-Caching granularity: full layer file index instead of 8 KiB pages?
-Multi-tenant IO scheduler & configurable QoS.
-tokio-epoll-uring: the poller task isn’t great ; can tokio implement this?

Thank you! Let’s connect.
Christian Schwarz
[email protected]
@problame
https://cschwarz.com
Tags