A Dist Sys Programmer's Journey into AI by Piotr Sarna

ScyllaDB 164 views 22 slides Mar 11, 2025
Slide 1
Slide 1 of 22
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22

About This Presentation

This talk explores the culture shock of transitioning from distributed databases to AI. While AI operates at massive scale, distributed storage and compute remain essential. Discover key differences, unexpected parallels, and how database expertise applies in the AI world.


Slide Content

A ScyllaDB Community
A Dist Sys Programmer's
Journey into AI
Piotr Sarna
Founding Engineer

Piotr Sarna
■Wannabe goat farmer
■Distributed systems hacker
■Open-source contributor & maintainer
■Book co-author
■Database Performance at Scale
■Writing for Developers: Blogs That Get Read
(code for 50% off on manning.com: SCALE2025)

Disclaimer
I’m not an AI programmer, I barely catch up with all the
acronyms and I’m utterly lost every time I look at
company slack channels.

How I ended up in AI
AI I remember
machine learning folks hacking neural networks on their Jupyter
notebooks.
AI now
machine learning folks trying to squeeze the whole Internet into a few
gigabytes, while also generating orders of magnitude more synthetic
data.

How I ended up in AI

Turns out processing “the whole Internet” is an overlap
with using distributed systems:
●store lots of data
●serve lots of data
●process lots of data
●generate lots of data

Key takeaways from the AI world
●“tokenization” does not mean the compiler stage
●“token” is the main measurement unit for everything
●1 token == “it’s complicated, but assume ~4 bytes”
●rwkv does not stand for “read write key-value”
●“model” has 42 different meanings (depends on the context)
●“context” has 42 different meanings (depends on the context)

Latency: what is it
“time delay between the cause and the effect of some
physical change in the system being observed”

L = λW

Pekka is writing a reportedly nice book about it,
assuming he ever actually finishes:
https://www.manning.com/books/latency

Latency: does AI care?
Not that much. Inference – maybe, but users already
have Stockholm syndrome for staring at responses
generated at “human typing” speed.

Throughput: what is it
“rate of message delivery over a communication channel
in a communication network”

L = λW

“Throughput” might technically come as sequel to the
“Latency” book (assuming it’s ever finished).

Throughput: does AI care?
Yep. Training is way more about high throughput.

Goodput?
Interestingly, duplicate/wrong/unordered data might be
welcome! In the right dose.

AI people call it “entropy.”

Scale: tokens worldwide
Wikipedia a few billion tokens
stackoverflow a few billion tokens
GitHub a few trillion tokens

“Natural” data
Data coming directly from the Dark Age™
-books
-academic papers
-hackernews comment section
-Source code repositories

Synthetic data
Not just made up stuff!
-wikipedia articles,
but analyzed with grammar checkers
-code,
but along with compilation warnings, errors, outputs
-made up stuff

Deduplication
Internet is quite a redundant cesspit place.
Deduplication algorithms are cool, fuzzy, and distributed!

Data lake
Systems for keeping barely structured data in one
place.

Data lakehouse
Because portmanteaux are cool (are they…)

Systems for storing barely structured data in one place,
while also allowing users to query it without losing their
minds.

Orchestrators
Systems for making sure data is processed:
●in the right order,
●in the right time,
●in a recoverable way if something bad happens.

Vector databases
Systems built specifically for storing and querying
vectors.

Issue: most of them will end up implementing “normal”
SQL features, that’s what users want.

Vector databases with a few vector features
Systems built specifically for storing data.

Adding vector search features tends to be way easier
than the other way round.

Summary: there’s lots of dist sys work in AI
ping me for more useful career tips

Tags