The Tale of Taming TigerBeetle's Tail Latency by Tobias Ziegler

ScyllaDB 0 views 32 slides Oct 17, 2025
Slide 1
Slide 1 of 32
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32

About This Presentation

Learn how we reduced TigerBeetle’s tail latency through algorithm engineering. ‘Algorithm engineering goes beyond studying theoretical complexity and considers how algorithms are executed efficiently on modern super-scalar CPUs. Specifically, we will look at Radix Sort and a k-way merge and expl...


Slide Content

A ScyllaDB Community
The Tale of Taming
TigerBeetle’s Tail
Tobias Ziegler
Software Engineer

Narrator: Tobias Ziegler (he/him)

■PhD in (Distributed) OLTP Databases
■I love performance engineering
■Sports fanatic & coffee enthusiast
Software Engineer at TigerBeetle

The Villain: TigerBeetle’s Tail

The Villain: TigerBeetle’s Tail

Throughput: 404k tx/s
Batch Latency:
■p50 = 13 ms
■p99 = 103 ms
■p100 = 104 ms

The Protagonist: TigerBeetle
High-Performance Financial Transactions Database (OLTP)
■Debit/Credit primitive instead of SQL (Oasis)
■Reliable, fast, highly available
■Replicated + persistent storage for fault tolerance

Are We I/O Bound?

I/O Is the Bottleneck of the Past
Network Bandwidth SSD Bandwidth

CPU Is the New Bottleneck
TODO: Nicer Figure

The Protagonist: TigerBeetle
Single-Core by Design
■Optimized for Updates & Writes
■Power-Law Workloads: Contention Cannot Be Parallelized
“Design is the beauty of turning
constraints into advantages”

So, let us change this!
What happens here?

(Simplified) Processing Model
TXN Batch
Processing
mutable table

Algorithm Selection & Design

K-Way Merge: Naive
Merge k sorted runs into one sorted sequence.

Naïve approach:

■Scan all k runs to find the minimum element.

■Repeat until all n elements are merged.

Time complexity: O(n · k)

R
u
n
R
u
n
R
u
n
R
u
n

Merge k sorted runs into one sorted sequence.

Tree-Based approaches:

■Select min-element in log(k).

■Repeat until all n elements are merged.

Time complexity: O(n · log(k))

K-Way Merge: Tree-Based
R
u
n
R
u
n
R
u
n
R
u
n

Tree of Losers


1
5

2
3

4
7

3
8

Initializing:

Tree of Losers

5


3

4
7

3
8

2 4
3
Previous
winner: 1
Only one path competes → 1 comparison per
level (vs. 2 in a min-heap). Asymptotically
optimal.

Did We Tame The Tail?
Not Really!
Throughput: 398k tx/s
Batch Latency:
■p50 = 13 ms
■p99 = 101 ms
■max = 102 ms

Algorithm Engineering
bridges theory and practice

Map our abstract algorithm onto actual hardware.

■Modern CPUs are quite complex:
●Pipelined, Super Scalar, Out Of Order, Speculative Execution
Need simple mental model to map our Algorithm to CPU.
■Everyone has a mental model, but how accurate is it?
■To find this out we need two things
●Representative Benchmark (different Key and Value Sizes, …)
●Tooling (Perf, Polar Signals, VTune …)
Methodology & Tooling

CPU Counters
CPUs have lots of performance counters:
■Instructions: machine instructions executed (“retired”)
■Cache/ TLB hits/misses: for different caches
■Branch hits/misses: (in)correctly predicted branches
■…

CPU Counters with perf stat
> perf stat ./loser_tree

Ok, so what?

CPU Counters with PerfEvent
■Library that can measure some part of your code
■Normalizes the CPU counters
■This allows you to reason about the algorithm/data structure

> ./loser_tree

Let’s Talk Performance
This is the naive Tree of Losers that we implemented.
TODO perf counters

Optimizing for Cache Misses: Eytzinger Layout
Rule #1: Data must be ready for the CPU to enable instruction-level parallelism!


TODO perf counters

Optimizing for TLB Misses: Huge Pages
TLB Misses cost in the order of ~50+ CPU cycles
TODO perf counters

Avoid Branch Mispredictions: Go Branchless
Branch Mispredictions cost around 15-20 cycles.
TODO perf counters

Reduce Instructions: Unrolling and Co.
Make every instruction count!
TODO perf counters

Ablation Study and Final Result
TODO perf counters

Incremental
TODO perf counters

At Last We Tamed the Tail.

Thank you! Let’s connect.
Tobias Ziegler
[email protected]
@Tobias__Ziegler
github.com/toziegler
Tags