The Tale of Taming TigerBeetle's Tail Latency by Tobias Ziegler
ScyllaDB
0 views
32 slides
Oct 17, 2025
Slide 1 of 32
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
About This Presentation
Learn how we reduced TigerBeetle’s tail latency through algorithm engineering. ‘Algorithm engineering goes beyond studying theoretical complexity and considers how algorithms are executed efficiently on modern super-scalar CPUs. Specifically, we will look at Radix Sort and a k-way merge and expl...
Learn how we reduced TigerBeetle’s tail latency through algorithm engineering. ‘Algorithm engineering goes beyond studying theoretical complexity and considers how algorithms are executed efficiently on modern super-scalar CPUs. Specifically, we will look at Radix Sort and a k-way merge and explore how to implement them efficiently. We then demonstrate how we apply these algorithms incrementally to avoid latency spikes in practices.
Size: 2.64 MB
Language: en
Added: Oct 17, 2025
Slides: 32 pages
Slide Content
A ScyllaDB Community
The Tale of Taming
TigerBeetle’s Tail
Tobias Ziegler
Software Engineer
Narrator: Tobias Ziegler (he/him)
■PhD in (Distributed) OLTP Databases
■I love performance engineering
■Sports fanatic & coffee enthusiast
Software Engineer at TigerBeetle
The Villain: TigerBeetle’s Tail
The Villain: TigerBeetle’s Tail
Throughput: 404k tx/s
Batch Latency:
■p50 = 13 ms
■p99 = 103 ms
■p100 = 104 ms
The Protagonist: TigerBeetle
High-Performance Financial Transactions Database (OLTP)
■Debit/Credit primitive instead of SQL (Oasis)
■Reliable, fast, highly available
■Replicated + persistent storage for fault tolerance
Are We I/O Bound?
I/O Is the Bottleneck of the Past
Network Bandwidth SSD Bandwidth
CPU Is the New Bottleneck
TODO: Nicer Figure
The Protagonist: TigerBeetle
Single-Core by Design
■Optimized for Updates & Writes
■Power-Law Workloads: Contention Cannot Be Parallelized
“Design is the beauty of turning
constraints into advantages”
So, let us change this!
What happens here?
(Simplified) Processing Model
TXN Batch
Processing
mutable table
Algorithm Selection & Design
K-Way Merge: Naive
Merge k sorted runs into one sorted sequence.
Naïve approach:
■Scan all k runs to find the minimum element.
■Repeat until all n elements are merged.
Time complexity: O(n · k)
R
u
n
R
u
n
R
u
n
R
u
n
Merge k sorted runs into one sorted sequence.
Tree-Based approaches:
■Select min-element in log(k).
■Repeat until all n elements are merged.
Time complexity: O(n · log(k))
K-Way Merge: Tree-Based
R
u
n
R
u
n
R
u
n
R
u
n
Tree of Losers
1
5
…
2
3
…
4
7
…
3
8
…
Initializing:
Tree of Losers
5
…
3
…
4
7
…
3
8
…
2 4
3
Previous
winner: 1
Only one path competes → 1 comparison per
level (vs. 2 in a min-heap). Asymptotically
optimal.
Did We Tame The Tail?
Not Really!
Throughput: 398k tx/s
Batch Latency:
■p50 = 13 ms
■p99 = 101 ms
■max = 102 ms
Algorithm Engineering
bridges theory and practice
Map our abstract algorithm onto actual hardware.
■Modern CPUs are quite complex:
●Pipelined, Super Scalar, Out Of Order, Speculative Execution
Need simple mental model to map our Algorithm to CPU.
■Everyone has a mental model, but how accurate is it?
■To find this out we need two things
●Representative Benchmark (different Key and Value Sizes, …)
●Tooling (Perf, Polar Signals, VTune …)
Methodology & Tooling
CPU Counters
CPUs have lots of performance counters:
■Instructions: machine instructions executed (“retired”)
■Cache/ TLB hits/misses: for different caches
■Branch hits/misses: (in)correctly predicted branches
■…
CPU Counters with perf stat
> perf stat ./loser_tree
Ok, so what?
CPU Counters with PerfEvent
■Library that can measure some part of your code
■Normalizes the CPU counters
■This allows you to reason about the algorithm/data structure
> ./loser_tree
Let’s Talk Performance
This is the naive Tree of Losers that we implemented.
TODO perf counters
Optimizing for Cache Misses: Eytzinger Layout
Rule #1: Data must be ready for the CPU to enable instruction-level parallelism!
TODO perf counters
Optimizing for TLB Misses: Huge Pages
TLB Misses cost in the order of ~50+ CPU cycles
TODO perf counters
Avoid Branch Mispredictions: Go Branchless
Branch Mispredictions cost around 15-20 cycles.
TODO perf counters
Reduce Instructions: Unrolling and Co.
Make every instruction count!
TODO perf counters
Ablation Study and Final Result
TODO perf counters