40x Faster Binary Search by Ragnar Groot Koerkamp

ScyllaDB 0 views 28 slides Oct 17, 2025
Slide 1
Slide 1 of 28
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28

About This Presentation

This talk will first expose the lie that binary search takes O(lg n) time -- it very much does not! Instead, we will see that binary search has only constant overhead compared to an oracle. Then, we will exploit everything that modern CPUs have to offer (SIMD, ILP, prefetching, efficient caching) in...


Slide Content

P99 CONF

© a ScyllaDB Community

40x faster binary search

Ragnar Groot Koerkamp Pouce 7
PhD ETHzürich ~~ curiouscoding.nl/slides/p99

RAGNAR {GROOT KOERKAMP}

+ Did IMO & ICPC; currently head-of-jury for NWERC.
+ Some time at Google.
¢ Quit and solved all of projecteuler.net (700+) during Covid.
e Just finished PhD on high throughput bioinformatics @ ETH Zurich.
= Lots of sequenced DNA that needs processing.
= Many static datasets, e.g. a 3GB human genome.
= Revisiting basic algorithms and optimizing them to the limit.
= Good in theory # fast in practice.

PROBLEM STATEMENT

Input: a static sorted list of 32 bit integers.
+ Queries: given q, find the smallest value in the list > q.
trait SearchIndex {
// Initialize the data structure.
fn new(sorted_data: &[u32]) -> Self;
// Return the smallest value >=q.

fn query(&self, q; u32) -> u32;
}

+ Why? E.g. searching through a suffix array.

+ Previous work on Algorithmica:
en.algorithmica.org/hpc/data-structures/s-tree

¢ This work: curiouscoding.nl/posts/static-search-tree

BINARY SEARCH: COMPLEXITY Ollg n)

1. Compare q = 5 with the middle element x.
2. Recurse on left half if q < x, right half ifq > x.

3. End when 1 element left after [lg,(n + 1)] steps.

LA

|

1121314

5

6

8

10

11

12

13

14115

BINARY SEARCH: LATENCY IS MORE THAN Ollg n)

Latency of binary search

—— Binary Search
— ~lg(n)

Latency (ns)
w » u
8 3 g
8 8 8

©
S
Ss

=
o
Ss

L1 2 L RAM

16K 64K 256K 1M 4M 16M 64M 256M 1G 4G
Array size (B)

ARRAY INDEXING: O(n°#) LATENCY!

Latency of binary search vs random array indexing

—— Binary Search
300 | —— Array indexing
— -Ig(n)

---- ~n70.35

L3 RAM
16K 64K 256K 1M 4M 16M 64M 256M 1G 4G
Array size (B)

HEAP LAYOUT: EFFICIENT CACHING + PREFETCHING

Binary search: top of tree is spread out; each cache line has 1 value.
+ Eytzinger layout: top layers of tree are clustered in cache lines.
= Also allows prefetching!

Classic

coon ooo

Eytzinger/heap layout

1

HEAP LAYOUT: CLOSE TO ARRAY INDEXING!

3 Latency of binary search vs Eytzinger layout

—— Binary Search
300 | —— Array indexing
Eytzinger
-Ig(n)

250

200

150

Latency (ns)

100

50

3 RAM
16K 64K 256K 1M 4M 16M 64M 256M 1G 4G
Array size (B)

STATIC SEARCH TREES / B-TREES

+ Fully use each cache line.

Layer 0

6 l12]2|4]8 [10/14/1680 |1 [23 [4] 5[6]7 [8] 9 [10] 11 [12] 13 [14] 15 [16] 17

on=0 oj=1 034 Net =03=18

#{repr(align(64))] // Each block fills exactly one 512-bit cache line
struct Block { data: [u32; 16] }
struct Tree {

tree: Vec<Block>, // Blocks.

offsets: Vec<usize>, // Index of first block in each layer.

}

STATIC SEARCH TREES: SLOWER THAN EYTZINGER?!

UP NEXT: ASSEMBLY-LEVEL OPTIMIZATIONS :)

OPTIMIZING find: LINEAR SCAN BASELINE

fn find_linear(block: &Block, q: u32) -> usize {
fori in 0.N {
if block data[i] >= q {
// Early break causes branch mispredictions
// and prevents auto-vectorization!
return i;
}
}
return N;

}

OPTIMIZING find: AUTO-VECTORIZATION

fn find_count(block: &Block, q: u32) -> usize {

let mut count = 0;
for i in 0..N {

if block data[i] <q {

count += 1;

}
}
count

}

vmovdqu (%rax,%rex), %ymml — ; load datal..8]
vmovdqu — 32(%rax,%rcx), %ymm2 ; load data[8..]
vpbroadcastd %xmmO, %ymm0 ;'splat' the query value
vpmaxud %ymmO, %ymm2, %ymm3 ;v

vpempeqd %ymm3, %ymm2, Symm2 ; v

vpmaxud %ymmO, %ymml, %ymmO ; v

vpempeqd %ymmO, %ymml, #ymmO ;4x compare query with values
vpackssdw %ymm2, %ymm0, JoymmO ;

vpempegd %oymmi, %ymml, %ymml ;v

vpxor ymml, %ymmO, %ymm0 ;2x negate result
vextractil28 $1, %ymmO, %xmml — ;v

vpacksswb Yoxmml, %xmm0, %xmm0 ; v

vpshufd $216, %xmm0, %xmm0 ;v

vpmovmskb Yexmm0, %ecx ; 4x extract mask

ponent! _ %ecx. Hecx

OPTIMIZING find: POPCOUNT

#![feature(portable_simd)]
fn find_popcount(block: &Block,
let data: Simd<i32, N> = Simd::from_slice(&block.data[0..N]);
let q = Simd::splat(q);
let m data.simd_lQ); //xlil<q
mask.to_bitmask().count_ones() // count_ones

}

vpempgtd (%rsi,%rdi), %ymm0, %ymml
vpempgtd 32(%rsi,%rdi), ymmO, %ymmO

vpackssdw Yoymml, %ymmO, %ymm0 _ ; interleave 16bit low halves
vextractil28 $1, %ymm0, %xmml ;1

vpacksswb %xmml, %xmm0, %xmm0 ;2

vpshufd $216, %xmm0, %xmmO — ;3 unshuffle interleaving
vpmovmskb —%xmm0, %edi ; 4 instructions to extract bitmask
popentl — Jedi, %edi

OPTIMIZING find: MANUAL movemask_epi8

fn find_manual(block: &Block, q: 132) -> usize { // 32 now!
let low: Simd<i32, 8> = Simd::from_slice(&self.data[0..N / 2]);
let high: Simd<i32, 8> = Simd::from_slice(&self.data[N /2..N]);
let q = Simd::<i32, 8>::splat(q);

let mask_low d.simd_gt(low ); // compare with low half
let mask_high = q_simd.simd_gt(high); // compare with high half
let merged =_mm256_packs_epi32(mask_low, mask_high); // interleave

let mask: i32 = _mm256_movemask_epi8(merged); // movemask_epi8!
mask.count_ones() as usize / 2 // correct for double-counting

}

vpempgtd (rsi %rdi), Yoymm0, %ymml — ; 1 cycle, in parallel
vpempgtd 32(%rsi,%rdi), %ymmO, %ymm0 ; 1 cycle, in parallel
vpackssdw Yoymm0, %ymml, %ymm0 ; Leycle
-vextractil28 $1, %ymm0, %xmml ;

-vpacksswb %xmml, %xmmO, Yoxmm0 ;

-vpshufd $216, %xmm0, %xmmO ;

-vpmovmskb %xmm0, %edi ;4 instructions emulating movemask_epil6
+vpmovmskb Yeymm0, %edi ;2cycles movemask_epi8
popentl — %edi, %edi ; Leycle

# ; /2 is folded into pointer arithmetic

THE RESULTS: BRANCHLESS IS GREAT!

THROUGHPUT, NOT LATENCY

BATCHING: MANY QUERIES IN PARALLEL

-fn query (&self,q: u32 )-> u32
+fn batch<const B: usize>(&self, qs: &[u32; BJ) -> [u32; B] {

- letmutk= 0 // current index
+ let mutk=[0; B]; // current indices
for [o, _02] in self.offsets.array_windowsQ { // walk down the tree

+ foriin0.B{
let jump_to = self.node(o + k[i]).find(qbli]); // call “find
K(i] =k(i] * (B + 1) + jump_to; // update index
#

}

let o = self.offsets.last().unwrapO;
from_fn(lil {

let idx = self.node(o + kli]).find(gbli));

self.tree[o + k{i] + idx / N].datafidx % N] //return value


}

+

BATCHING: UP TO 2.5X FASTER!

PREFETCHING

fn batch<const B: usize>(&self, qs: &[u32; B]) -> [u32; B] {

let mut k = [0; BJ; II current indices
for [o, 02] in self.offsets.array_windows() { // walk down the tree
for iin 0..B {
let jump_to = self.node(o + kli]).find(qbli]); / call “find”
Ki] = Ki] * (B + 1) + jump_to; Il update index
+ prefetch_index(&self.tree, 02 + kli]); // prefetch next layer
}
+
let o = self.offsets.last().unwrap();
from_fndil {
let idx = self.node(o + K[i)).find(gb(i);

self.tree[o + kli] + idx / N].datalidx % N] //return value
»

PREFETCHING: 30% FASTER AGAIN!

OPTIMIZING POINTER ARITHMETIC: MORE GAINS

+ Convert all pointers to byte units, to avoid conversions.

INTERLEAVING: MORE PRESSURE ON THE RAM, -20%

+ Interleave multiple batches at different iterations.

TREE LAYOUT: INTERNAL NODES STORE MINIMA

e For query 5.5, we walk down the left subtree.
+ Returning 6 reads a new cache line.

Layer 0 6 |12

Layer 1 2 4 8 |10 14 16

Layer2 | 0 | 1 273 [415 6| 7 || 8 | 9 |} 10) 11 1213/14/15 |] 16) 17

6112121418/|10114|1610|112|13 11/12113/14115 116} 17

E
Bo
3
=
a
5

0 o=1 on=4 arts = 0213

TREE LAYOUT: INTERNAL NODES STORE MAXIMA

e For query 5.5, we walk down the middle subtree.
+ Returning 6 reads the same cache line.

Layer 0 5111

Layer 1 113 218 13115

Layer2 | 0 |1 21311415 6 | 7 [fs || 9 || 10 11 12 [13/1114 115) | 16 | 17

51111131719113|11510|11213141516171 8 || 9 110 (11 12 [13/14 [15/16 | 17

=0 o=1 on=4 torts = 0213

TREE LAYOUT: ANOTHER 10% GAINED!

CONCLUSION

+ With AGB input: 40x speedup!
= binary search: 1150ns
= static search tree: 27ns
+ Latency of Eytzinger layout search is 3x slower than RAM latency.
+ Throughput of static search tree is 3x slower than RAM throughput.
+ RAM is slow, caches are fast.
+ Grinding through assembly works; be your own compiler!
+ Theoretical big-O complexities are useless here, since caches invalidate
the RAM-model assumptions.

P89 CONF

e X: @curious_coding
¢ Bsky: @curiouscoding.nl
+ Blog: curiouscoding.nl
= /posts/static-search-tree
= /slides/p99
Tags