40x Faster Binary Search by Ragnar Groot Koerkamp

ScyllaDB 0 views 28 slides Oct 17, 2025

Slide 1 of 28

About This Presentation

This talk will first expose the lie that binary search takes O(lg n) time -- it very much does not! Instead, we will see that binary search has only constant overhead compared to an oracle. Then, we will exploit everything that modern CPUs have to offer (SIMD, ILP, prefetching, efficient caching) in...

Size: 13.99 MB

Language: en

Added: Oct 17, 2025

Slides: 28 pages

Slide Content

P99 CONF

© a ScyllaDB Community

40x faster binary search

Ragnar Groot Koerkamp Pouce 7
PhD ETHzürich ~~ curiouscoding.nl/slides/p99

RAGNAR {GROOT KOERKAMP}

+ Did IMO & ICPC; currently head-of-jury for NWERC.
+ Some time at Google.
¢ Quit and solved all of projecteuler.net (700+) during Covid.
e Just finished PhD on high throughput bioinformatics @ ETH Zurich.
= Lots of sequenced DNA that needs processing.
= Many static datasets, e.g. a 3GB human genome.
= Revisiting basic algorithms and optimizing them to the limit.
= Good in theory # fast in practice.

PROBLEM STATEMENT

Input: a static sorted list of 32 bit integers.
+ Queries: given q, find the smallest value in the list > q.
trait SearchIndex {
// Initialize the data structure.
fn new(sorted_data: &[u32]) -> Self;
// Return the smallest value >=q.

fn query(&self, q; u32) -> u32;
}

+ Why? E.g. searching through a suffix array.

+ Previous work on Algorithmica:
en.algorithmica.org/hpc/data-structures/s-tree

¢ This work: curiouscoding.nl/posts/static-search-tree

BINARY SEARCH: COMPLEXITY Ollg n)

1. Compare q = 5 with the middle element x.
2. Recurse on left half if q < x, right half ifq > x.

3. End when 1 element left after [lg,(n + 1)] steps.

LA

|

1121314

5

6

8

10

11

12

13

14115

BINARY SEARCH: LATENCY IS MORE THAN Ollg n)

Latency of binary search

—— Binary Search
— ~lg(n)

Latency (ns)
w » u
8 3 g
8 8 8

©
S
Ss

=
o
Ss

L1 2 L RAM

16K 64K 256K 1M 4M 16M 64M 256M 1G 4G
Array size (B)

ARRAY INDEXING: O(n°#) LATENCY!

Latency of binary search vs random array indexing

—— Binary Search
300 | —— Array indexing
— -Ig(n)

---- ~n70.35

L3 RAM
16K 64K 256K 1M 4M 16M 64M 256M 1G 4G
Array size (B)

HEAP LAYOUT: EFFICIENT CACHING + PREFETCHING

Binary search: top of tree is spread out; each cache line has 1 value.
+ Eytzinger layout: top layers of tree are clustered in cache lines.
= Also allows prefetching!

Classic

coon ooo

Eytzinger/heap layout

1

HEAP LAYOUT: CLOSE TO ARRAY INDEXING!

3 Latency of binary search vs Eytzinger layout

—— Binary Search
300 | —— Array indexing
Eytzinger
-Ig(n)

250

200

150

Latency (ns)

100

50

3 RAM
16K 64K 256K 1M 4M 16M 64M 256M 1G 4G
Array size (B)

STATIC SEARCH TREES / B-TREES

+ Fully use each cache line.

Layer 0

6 l12]2|4]8 [10/14/1680 |1 [23 [4] 5[6]7 [8] 9 [10] 11 [12] 13 [14] 15 [16] 17

on=0 oj=1 034 Net =03=18

#{repr(align(64))] // Each block fills exactly one 512-bit cache line
struct Block { data: [u32; 16] }
struct Tree {

tree: Vec<Block>, // Blocks.

offsets: Vec<usize>, // Index of first block in each layer.

}

STATIC SEARCH TREES: SLOWER THAN EYTZINGER?!

UP NEXT: ASSEMBLY-LEVEL OPTIMIZATIONS :)

OPTIMIZING find: LINEAR SCAN BASELINE

fn find_linear(block: &Block, q: u32) -> usize {
fori in 0.N {
if block data[i] >= q {
// Early break causes branch mispredictions
// and prevents auto-vectorization!
return i;
}
}
return N;

}

OPTIMIZING find: AUTO-VECTORIZATION

fn find_count(block: &Block, q: u32) -> usize {

let mut count = 0;
for i in 0..N {

if block data[i] <q {

count += 1;

}
}
count

}

vmovdqu (%rax,%rex), %ymml — ; load datal..8]
vmovdqu — 32(%rax,%rcx), %ymm2 ; load data[8..]
vpbroadcastd %xmmO, %ymm0 ;'splat' the query value
vpmaxud %ymmO, %ymm2, %ymm3 ;v

vpempeqd %ymm3, %ymm2, Symm2 ; v

vpmaxud %ymmO, %ymml, %ymmO ; v

vpempeqd %ymmO, %ymml, #ymmO ;4x compare query with values
vpackssdw %ymm2, %ymm0, JoymmO ;

vpempegd %oymmi, %ymml, %ymml ;v

vpxor ymml, %ymmO, %ymm0 ;2x negate result
vextractil28 $1, %ymmO, %xmml — ;v

vpacksswb Yoxmml, %xmm0, %xmm0 ; v

vpshufd $216, %xmm0, %xmm0 ;v

vpmovmskb Yexmm0, %ecx ; 4x extract mask

ponent! _ %ecx. Hecx

OPTIMIZING find: POPCOUNT

#![feature(portable_simd)]
fn find_popcount(block: &Block,
let data: Simd<i32, N> = Simd::from_slice(&block.data[0..N]);
let q = Simd::splat(q);
let m data.simd_lQ); //xlil<q
mask.to_bitmask().count_ones() // count_ones

}

vpempgtd (%rsi,%rdi), %ymm0, %ymml
vpempgtd 32(%rsi,%rdi), ymmO, %ymmO

vpackssdw Yoymml, %ymmO, %ymm0 _ ; interleave 16bit low halves
vextractil28 $1, %ymm0, %xmml ;1

vpacksswb %xmml, %xmm0, %xmm0 ;2

vpshufd $216, %xmm0, %xmmO — ;3 unshuffle interleaving
vpmovmskb —%xmm0, %edi ; 4 instructions to extract bitmask
popentl — Jedi, %edi

OPTIMIZING find: MANUAL movemask_epi8

fn find_manual(block: &Block, q: 132) -> usize { // 32 now!
let low: Simd<i32, 8> = Simd::from_slice(&self.data[0..N / 2]);
let high: Simd<i32, 8> = Simd::from_slice(&self.data[N /2..N]);
let q = Simd::<i32, 8>::splat(q);

let mask_low d.simd_gt(low ); // compare with low half
let mask_high = q_simd.simd_gt(high); // compare with high half
let merged =_mm256_packs_epi32(mask_low, mask_high); // interleave

let mask: i32 = _mm256_movemask_epi8(merged); // movemask_epi8!
mask.count_ones() as usize / 2 // correct for double-counting

}

vpempgtd (rsi %rdi), Yoymm0, %ymml — ; 1 cycle, in parallel
vpempgtd 32(%rsi,%rdi), %ymmO, %ymm0 ; 1 cycle, in parallel
vpackssdw Yoymm0, %ymml, %ymm0 ; Leycle
-vextractil28 $1, %ymm0, %xmml ;

-vpacksswb %xmml, %xmmO, Yoxmm0 ;

-vpshufd $216, %xmm0, %xmmO ;

-vpmovmskb %xmm0, %edi ;4 instructions emulating movemask_epil6
+vpmovmskb Yeymm0, %edi ;2cycles movemask_epi8
popentl — %edi, %edi ; Leycle

# ; /2 is folded into pointer arithmetic

THE RESULTS: BRANCHLESS IS GREAT!

THROUGHPUT, NOT LATENCY

BATCHING: MANY QUERIES IN PARALLEL

-fn query (&self,q: u32 )-> u32
+fn batch<const B: usize>(&self, qs: &[u32; BJ) -> [u32; B] {

- letmutk= 0 // current index
+ let mutk=[0; B]; // current indices
for [o, _02] in self.offsets.array_windowsQ { // walk down the tree

+ foriin0.B{
let jump_to = self.node(o + k[i]).find(qbli]); // call “find
K(i] =k(i] * (B + 1) + jump_to; // update index
#

}

let o = self.offsets.last().unwrapO;
from_fn(lil {

let idx = self.node(o + kli]).find(gbli));

self.tree[o + k{i] + idx / N].datafidx % N] //return value
=»

}

+

BATCHING: UP TO 2.5X FASTER!

PREFETCHING

fn batch<const B: usize>(&self, qs: &[u32; B]) -> [u32; B] {

let mut k = [0; BJ; II current indices
for [o, 02] in self.offsets.array_windows() { // walk down the tree
for iin 0..B {
let jump_to = self.node(o + kli]).find(qbli]); / call “find”
Ki] = Ki] * (B + 1) + jump_to; Il update index
+ prefetch_index(&self.tree, 02 + kli]); // prefetch next layer
}
+
let o = self.offsets.last().unwrap();
from_fndil {
let idx = self.node(o + K[i)).find(gb(i);

self.tree[o + kli] + idx / N].datalidx % N] //return value
»

PREFETCHING: 30% FASTER AGAIN!

OPTIMIZING POINTER ARITHMETIC: MORE GAINS

+ Convert all pointers to byte units, to avoid conversions.

INTERLEAVING: MORE PRESSURE ON THE RAM, -20%

+ Interleave multiple batches at different iterations.

TREE LAYOUT: INTERNAL NODES STORE MINIMA

e For query 5.5, we walk down the left subtree.
+ Returning 6 reads a new cache line.

Layer 0 6 |12

Layer 1 2 4 8 |10 14 16

Layer2 | 0 | 1 273 [415 6| 7 || 8 | 9 |} 10) 11 1213/14/15 |] 16) 17

6112121418/|10114|1610|112|13 11/12113/14115 116} 17

E
Bo
3
=
a
5

0 o=1 on=4 arts = 0213

TREE LAYOUT: INTERNAL NODES STORE MAXIMA

e For query 5.5, we walk down the middle subtree.
+ Returning 6 reads the same cache line.

Layer 0 5111

Layer 1 113 218 13115

Layer2 | 0 |1 21311415 6 | 7 [fs || 9 || 10 11 12 [13/1114 115) | 16 | 17

51111131719113|11510|11213141516171 8 || 9 110 (11 12 [13/14 [15/16 | 17

=0 o=1 on=4 torts = 0213

TREE LAYOUT: ANOTHER 10% GAINED!

CONCLUSION

+ With AGB input: 40x speedup!
= binary search: 1150ns
= static search tree: 27ns
+ Latency of Eytzinger layout search is 3x slower than RAM latency.
+ Throughput of static search tree is 3x slower than RAM throughput.
+ RAM is slow, caches are fast.
+ Grinding through assembly works; be your own compiler!
+ Theoretical big-O complexities are useless here, since caches invalidate
the RAM-model assumptions.

P89 CONF

e X: @curious_coding
¢ Bsky: @curiouscoding.nl
+ Blog: curiouscoding.nl
= /posts/static-search-tree
= /slides/p99

40x Faster Binary Search by Ragnar Groot Koerkamp

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

40x Faster Binary Search by Ragnar Groot Koerkamp

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

DTI BPI Pivot Small Business - BUSINESS START UP PLAN

CATHOLIC EDUCATIONAL Corporate Responsibilities

Karin Schaupp – Evocation; lançamento: 2000

Pillars of Biblical Oneness in the Book of Acts

7-10. STP + Branding and Product &amp; Services Strategies.pptx

Business Legislation PPT - UNIT 1 jimllpkggg

7-10. STP + Branding and Product & Services Strategies.pptx