Rust + io_uring + ktls: How Fast Can We Make HTTP?

ScyllaDB 2,001 views 51 slides Oct 11, 2024
Slide 1
Slide 1 of 51
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51

About This Presentation

Working on Fluke: async Rust HTTP1+2 with io_uring & kTLS, sponsored by fly.io & Shopify. Unlike others, Fluke is built from the ground up to fully leverage io_uring, minimizing syscalls with kTLS. A promising future for proxies & apps if a stable API emerges. #Rust #io_uring #kTLS


Slide Content

A ScyllaDB Community
Rust, io_uring, ktls:
How fast can we make HTTP?
Amos Wenger
writer, video producer, cat owner
bearcove

A ScyllaDB Community
Nobody in the Rust space is going
far enough with io_uring
(as far as I'm aware)
Amos Wenger
writer, video producer, cat owner
bearcove

Amos Wenger (they/them) aka @fasterthanlime

writer, video producer, cat owner
■Wrote "Making our own executable packer"
■Teaching Rust since 2019 with Cool Bear
■Fan of TLS (thread-local storage & the other one)
bearcove

Define "HTTP"

Define "fast"

Rust HTTP is already fast

hyper on master is ?????? v1.4.1 via ?????? v1.80.1
❯ gl --color=always | tail -5
Commit: 886551681629de812a87555bb4ecd41515e4dee6
Author: Sean McArthur <[email protected]>
Date: 2014-08-30 14:18:28 -0700 (10 years ago)

init

HTTP/1.1 200 OK
Date: Fri, 31 Dec 1999 23:59:59 GMT
Content-Type: text/html
Content-Length: 1354

/// An HTTP status code (`status-code` in RFC 9110 et al.).
#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash )]
pub struct StatusCode(NonZeroU16);

mystery winner > itoa stack > itoa heap > std::fmt
criterion bench: format_status_code, avg µs

// A string of packed 3-ASCII-digit status code
// values for the supported range of [100, 999]
// (900 codes, 2700 bytes).
const CODE_DIGITS: &str = "\
100101102103104105106107108109110\
✂ ✂ ✂
989990991992993994995996997998999";

We're not bickering over
assembly anymore

My hypothesis

●spectre, meltdown, etc => mitigations
●mitigations => more expensive syscalls
●more expensive syscalls => io_uring

Type systems are hard

fn read(
&mut self,
buf: &mut [u8],
) -> Result<usize>

fn read(
&mut self,
buf: &mut [u8],
) -> Result<usize>

Lifetimes exist in every language

Rust merely explicits them

fn read(
&mut self,
buf: &mut [u8],
) -> Result<usize>

blocking
fn read(
&mut self,
buf: &mut [u8],
) -> Result<usize>

fn poll_read(
&mut self,
buf: &mut [u8],
) -> Poll<Result<usize>>
evented (O_NONBLOCK)

blocking
fn read(
&mut self,
buf: &mut [u8],
) -> Result<usize>

fn poll_read(
&mut self,
buf: &mut [u8],
) -> Poll<Result<usize>>
evented (O_NONBLOCK)

blocking
fn read(
&mut self,
buf: &mut [u8],
) -> Result<usize>

fn poll_read(
&mut self,
cx: &mut Context<'_>,
buf: &mut [u8],
) -> Poll<Result<usize>>
evented (O_NONBLOCK)

blocking
fn read(
&mut self,
buf: &mut [u8],
) -> Result<usize>
fn poll_read(
&mut self,
cx: &mut Context<'_>,
buf: &mut ReadBuf<'_>,
) -> Poll<Result<usize>>
evented (O_NONBLOCK)

blocking
fn read(
&mut self,
buf: &mut [u8],
) -> Result<usize>
fn poll_read(
&mut self,
cx: &mut Context<'_>,
buf: &mut ReadBuf<'_>,
) -> Poll<Result<()>>
evented (O_NONBLOCK)

blocking
fn read(
&mut self,
buf: &mut [u8],
) -> Result<usize>
fn poll_read(
self: Pin<&mut Self>,
cx: &mut Context<'_>,
buf: &mut ReadBuf<'_>,
) -> Poll<Result<()>>
evented (O_NONBLOCK)

blocking
fn read(
&mut self,
buf: &mut [u8],
) -> Result<usize>
fn poll_read(
self: Pin<&mut Self>,
cx: &mut Context<'_>,
buf: &mut ReadBuf<'_>,
) -> Poll<Result<()>>
evented (O_NONBLOCK)
fn read(
&mut self,
buf: &mut [u8]
) -> Read<'_, Self>
where Self: Unpin

async

blocking
fn read(
&mut self,
buf: &mut [u8],
) -> Result<usize>
fn read(
&mut self,
buf: &mut [u8]
) -> Read<'_, Self>
where Self: Unpin
async

blocking
fn read(
&mut self,
buf: &mut [u8],
) -> Result<usize>
async fn read(
&mut self,
buf: &mut [u8],
) -> Result<usize>
where Self: Unpin { ... }
async

async fn mhh(mut s: TcpStream) -> io::Result<Vec<u8>> {
let mut buf = vec![0u8; 4];
s.read_exact(&mut buf).await?;
Ok(buf)
}

async stack trace

read(&mut [u8])
read_exact(&mut [u8])
mhhh()

// (not shown: tokio runtime internals)

real stack trace

Read::poll(Pin<&mut Read>, &mut Context<'_>)
ReadExact::poll(Pin<&mut ReadExact>, &mut Context<'_>)
Mhh::poll(Pin<&mut Mhh>, &mut Context<'_>)

// (not shown: tokio runtime internals)

async fn mhh(mut s: TcpStream) -> io::Result<Vec<u8>> {
let mut buf = vec![0u8; 4];
tokio::select! {
result = s.read_exact(&mut buf) => {
result?;
Ok(buf)
}
_ = sleep(Duration::from_secs(1)) => {
Err(timeout_err())
}
}
}

async fn mhh(mut s: TcpStream) -> io::Result<Vec<u8>> {
let mut buf = vec![0u8; 4];
tokio::select! {
result = s.read_exact(&mut buf) => {
result?;
Ok(buf)
}
_ = sleep(Duration::from_secs(1)) => {
Err(timeout_err())
}
}
}

rio::Uring
pub fn recv<'a, Fd, Buf>(
&'a self,
stream: &'a Fd,
iov: &'a Buf
) -> Completion<'a, usize>

rio::Uring
impl<'a, C: FromCqe> Drop
for Completion<'a, C> {
fn drop(&mut self) {
self.wait_inner();
}
}

let mut buf = vec![0u8; 4];
let mut read_fut = Box::pin(s.read_exact(&mut buf));

tokio::select! {
_ = &mut read_fut => { todo!() }
_ = sleep(Duration::from_secs(1)) => {
std::mem::forget(read_fut);
Err(timeout_err())
}
}

tokio_uring::net::TcpStream
async fn read(&self, buf: T) -> (T, Result<usize>)
where T: BoundedBufMut;

Fine, I'll rewrite everything
on top of io-uring then.

docs.rs/loona

load testing is hard
■macOS = nice for dev, useless for perf
■P-states
■love your noisy neighbors
■stats are hard (coordinated omission etc.)

the plan?
■Intel(R) Xeon(R) CPU E3-1275 v6 @ 3.80GHz
■h2load from another dedicated server
■16 clients virtual clients, max 100 streams per
client
■python for automation (running commands over
SSH, CSV => XLS etc.)
■perf for counting cycles, instructions, branches

what's next?
●more/better benchmarks
●…on hardware from this decade
●proxying to HTTP/1.1, serving from disk
●messing with: allocators, buffer size
●io_uring: provided buffers, multishot accept/read
●move off of tokio entirely (no atomics needed, no "write to
unpark thread" needed)

how do we make that happen?
●money donations
●hardware donations
●expertise donations
●did I mention money

Thank you! Let’s connect.
Amos Wenger
[email protected]
@fasterthanlime
https://fasterthanli.me
bearcove
Tags