Parsing Protobuf as Fast as Possible by Miguel Young de la Sota
ScyllaDB
0 views
23 slides
Oct 14, 2025
Slide 1 of 23
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
About This Presentation
Protobuf is an extremely popular binary data interchange format. This session dives into hyperpb, a Protobuf parser for Go that uses every trick in the book, and some new ones, to achieve the highest throughout possible. We'll talk about table-driven parsing, the parser paradigm used by hyperpb,...
Protobuf is an extremely popular binary data interchange format. This session dives into hyperpb, a Protobuf parser for Go that uses every trick in the book, and some new ones, to achieve the highest throughout possible. We'll talk about table-driven parsing, the parser paradigm used by hyperpb, and hyperpb's innovations on the concept. We'll also take a look at the high-level architecture of hyperpb, including its compiler, parser VM, and layout optimizations.
Size: 1.6 MB
Language: en
Added: Oct 14, 2025
Slides: 23 pages
Slide Content
A ScyllaDB Community
Parsing Protobuf as
Fast as Possible
Miguel Young de la Sota
Engineer
Miguel Young de la Sota
Engineer at Buf
■Designed Protobuf Editions and Protobuf Rust.
■Obsessed with pref at all costs in every language.
■I’ve ICEed every compiler I’ve ever touched…
■Also a terminally-online furry artist!
What’s a Protobuf?
●Somewhat dated (c. 1999) binary serialization format from Google, centered
around a tag-length-value record layer. Tags are 29-bit integers.
●Schema-aware ecosystem: types are defined in IDL files, which can be
compiled into efficient data structures + codecs in many languages.
●One Protobuf type (a message) compiles to one parser function (giant
switch-case on record tags).
●Fields can be accessed directly with accessors or generically with
Java/Go-like dynamic reflection.
my_proto.pb.go
// “Fast” implementation of parsing a particular message in Go.
func (m *MyProto) Unmarshal(b []byte) error {
for len(b) > 0 {
num, typ, n := protowire.ConsumeTag(b)
if n < 0 { /* error */ }
b = b[n:]
switch num {
// One case per field.
}
}
}
my_proto.pb.go
// “Fast” implementation of parsing a particular message in Go.
func (m *MyProto) Unmarshal(b []byte) error {
for len(b) > 0 {
num, typ, n := protowire.ConsumeTag(b)
if n < 0 { /* error */ }
b = b[n:]
switch num {
// One case per field.
}
}
}
Branch predictor and
icache HATE this!
“Fast” Protobuf parsers are branchy
●Big switches are common in the “fastest” Protobuf parsers.
●But they have terrible codegen. In most cases, they need to be turned into one
branch instruction per case, essentially turning the parser into a “branch
maze” that confuses the branch predictor.
●This is also tons of assembly: if you parse many different message types with
lots of different fields, the instruction cache will clog, leading to stalls.
●If you’re lucky, the switch turns into a jump table, but conditions must be just
right (e.g. contiguous case values).
BUT WAIT!
An interpreter VM is just
a giant switch-case…
Anatomy of a threaded interpreter
The best non-jitting interpreters use computed goto and threaded code, which
basically looks something like this:
1.Loop over instruction stream.
2.Load opcode, index into table of function pointers.
3.Execute indirect branch to instruction impl, update interpreter state.
4.Repeat.
This allows us to use the branch predictor to emulate icache for our interpreter.
vm.go
func Exec(prog []byte) {
vm := VM{prog: prog}
for !vm.halt {
opcode := prog[vm.pc] // Load opcode.
vm.pc++
thunk := thunks[opcode] // Select instruction thunk.
vm = thunk(vm) // Update VM state.
}
}
Instruction Thunks
An instruction thunk can do anything:
■Perform arithmetic on VM state.
■Update vm.pc (i.e. a branch instruction).
■Crash the VM by setting vm.halt (invalid opcode).
Modern CPUs are fantastic at predicting indirect branches, especially when there
is a single very hot branch with lots of targets. For example, Zen 4 can handle ~64
targets when a single very hot branch is involved without breaking a sweat.
Table-Driven Parsing:
Record tags as instruction opcodes
hyperpb: a Protobuf VM
●hyperpb is a Go Protobuf parser based on a VM that interprets Protobuf as
bytecode.
●“TDP” was pioneered by Google’s UPB library, and is used in Protobuf C++.
●hyperpb innovates on the idea, leveraging Go’s quirks to build the fastest
dynamic Protobuf parser available.
<github.com/bufbuild/hyperpb-go>
using_hyperpb.go
// Compile a type for your message, similar to compiling a regex.
msgType := hyperpb.CompileMessageDescriptor(...)
// Allocate a fresh message using that type.
msg := hyperpb.NewMessage(msgType)
// Parse the message!
if err := proto.Unmarshal(myData, msg); err != nil {
// Handle parse failure.
}
// Do stuff with msg using reflection.
Anatomy of hyperpb
●hyperpb is like a regex library, requiring you to compile optimized parsers at
runtime. hyperpb’s optimizing compiler is slow, but it’s a one-time cost.
●The compiler determines an optimal memory layout for the message and generates
configuration for setting the VM’s opcode tables to parse for that message type.
●Unmarshalling calls into the VM, which looks just like the threaded, computed goto
VM we sketched earlier.
●Accessing fields requires Protobuf reflection, and the parsed value is immutable.
●All memory is manually managed by hyperpb using sophisticated arenas.
hyperpb is 2-3x faster than
Protobuf Go’s gencode
Throwing the compiler book at Protobuf
●hyperpb’s compiler is a genuine optimizing compiler, with multiple
intermediate representations and interprocedural (i.e., inter-message-type)
analyses.
●Fields are classified as “hot” or “cold”, influencing how memory is allocated
for them.
●Field representations are selected based many inputs; there are over 200 field
representations to choose from, although only ~10 will be used by any given
workload.
●Linker resolves inter-message references at the end.
Influencing register allocation in the VM
●VM state consists of eight 64-bit words (split into vm.P1 and vm.P2 to work
around a Go compiler bug).
●All VM functions pass them as input and output: p1, p2 = f(p1, p2). VM
state is thus always kept in the first eight argument registers.
●Extreme measures are taken to ensure VM state is never spilled to the stack.
This is a significant latency reduction, by taking the core’s store queue out of
the equation.
Profile-guided optimization
●The VM can record profiles to feed back into the compiler.
●This allows the compiler to learn which fields are hot and how large to
pre-allocate buffers.
●Reusing arenas effectively “profiles” memory usage, amortizing trips into Go’s
general-purpose allocator down to zero.
Zero-copy parsing
●Many fields can be decoded without copying any bytes!
●String fields, repeated fixed-width fields, and repeated varint fields with small
values are all decoded without copying, resulting in absurd ~10 Gbps
throughputs on some benchmarks (e.g. large ML tensors).
●Zero-copy also enables pointer compression on string types, reducing overall
memory usage.
Learn more!
Performance should be super accessible and welcoming. I’ve done my part to
make learning how hyperpb works easy and fun.
●Read the code! I’ve meticulously commented it all! :)
●Read my blogpost! It’s this session in more detail.
●Read about arenas! Goes into detail on memory management.
●Read Buf’s announcement! More content coming soon :D
Thank you! Let’s connect.
Miguel Young de la Sota [email protected]
@mcy.gay (Bluesky)
https://mcyoung.xyz