This talk will first expose the lie that binary search takes O(lg n) time -- it very much does not! Instead, we will see that binary search has only constant overhead compared to an oracle. Then, we will exploit everything that modern CPUs have to offer (SIMD, ILP, prefetching, efficient caching) in...
This talk will first expose the lie that binary search takes O(lg n) time -- it very much does not! Instead, we will see that binary search has only constant overhead compared to an oracle. Then, we will exploit everything that modern CPUs have to offer (SIMD, ILP, prefetching, efficient caching) in order to gain 40x increased throughput over the Rust standard library implementation.
Ragnar Groot Koerkamp Pouce 7
PhD ETHzürich ~~ curiouscoding.nl/slides/p99
RAGNAR {GROOT KOERKAMP}
+ Did IMO & ICPC; currently head-of-jury for NWERC.
+ Some time at Google.
¢ Quit and solved all of projecteuler.net (700+) during Covid.
e Just finished PhD on high throughput bioinformatics @ ETH Zurich.
= Lots of sequenced DNA that needs processing.
= Many static datasets, e.g. a 3GB human genome.
= Revisiting basic algorithms and optimizing them to the limit.
= Good in theory # fast in practice.
PROBLEM STATEMENT
Input: a static sorted list of 32 bit integers.
+ Queries: given q, find the smallest value in the list > q.
trait SearchIndex {
// Initialize the data structure.
fn new(sorted_data: &[u32]) -> Self;
// Return the smallest value >=q.
fn query(&self, q; u32) -> u32;
}
+ Why? E.g. searching through a suffix array.
+ Previous work on Algorithmica:
en.algorithmica.org/hpc/data-structures/s-tree
¢ This work: curiouscoding.nl/posts/static-search-tree
BINARY SEARCH: COMPLEXITY Ollg n)
1. Compare q = 5 with the middle element x.
2. Recurse on left half if q < x, right half ifq > x.
3. End when 1 element left after [lg,(n + 1)] steps.
Binary search: top of tree is spread out; each cache line has 1 value.
+ Eytzinger layout: top layers of tree are clustered in cache lines.
= Also allows prefetching!
let mask_low d.simd_gt(low ); // compare with low half
let mask_high = q_simd.simd_gt(high); // compare with high half
let merged =_mm256_packs_epi32(mask_low, mask_high); // interleave
let mask: i32 = _mm256_movemask_epi8(merged); // movemask_epi8!
mask.count_ones() as usize / 2 // correct for double-counting
let mut k = [0; BJ; II current indices
for [o, 02] in self.offsets.array_windows() { // walk down the tree
for iin 0..B {
let jump_to = self.node(o + kli]).find(qbli]); / call “find”
Ki] = Ki] * (B + 1) + jump_to; Il update index
+ prefetch_index(&self.tree, 02 + kli]); // prefetch next layer
}
+
let o = self.offsets.last().unwrap();
from_fndil {
let idx = self.node(o + K[i)).find(gb(i);
+ With AGB input: 40x speedup!
= binary search: 1150ns
= static search tree: 27ns
+ Latency of Eytzinger layout search is 3x slower than RAM latency.
+ Throughput of static search tree is 3x slower than RAM throughput.
+ RAM is slow, caches are fast.
+ Grinding through assembly works; be your own compiler!
+ Theoretical big-O complexities are useless here, since caches invalidate
the RAM-model assumptions.