Parsing Protobuf as Fast as Possible by Miguel Young de la Sota

A ScyllaDB Community
Parsing Protobuf as
Fast as Possible
Miguel Young de la Sota
Engineer

Miguel Young de la Sota

Engineer at Buf
■Designed Protobuf Editions and Protobuf Rust.
■Obsessed with pref at all costs in every language.
■I’ve ICEed every compiler I’ve ever touched…
■Also a terminally-online furry artist!

What’s a Protobuf?
●Somewhat dated (c. 1999) binary serialization format from Google, centered
around a tag-length-value record layer. Tags are 29-bit integers.
●Schema-aware ecosystem: types are deﬁned in IDL ﬁles, which can be
compiled into eﬃcient data structures + codecs in many languages.
●One Protobuf type (a message) compiles to one parser function (giant
switch-case on record tags).
●Fields can be accessed directly with accessors or generically with
Java/Go-like dynamic reﬂection.

my_proto.pb.go
// “Fast” implementation of parsing a particular message in Go.
func (m *MyProto) Unmarshal(b []byte) error {
for len(b) > 0 {
num, typ, n := protowire.ConsumeTag(b)
if n < 0 { /* error */ }
b = b[n:]

switch num {
// One case per field.
}
}
}

my_proto.pb.go
// “Fast” implementation of parsing a particular message in Go.
func (m *MyProto) Unmarshal(b []byte) error {
for len(b) > 0 {
num, typ, n := protowire.ConsumeTag(b)
if n < 0 { /* error */ }
b = b[n:]

switch num {
// One case per field.
}
}
}
Branch predictor and
icache HATE this!

“Fast” Protobuf parsers are branchy
●Big switches are common in the “fastest” Protobuf parsers.
●But they have terrible codegen. In most cases, they need to be turned into one
branch instruction per case, essentially turning the parser into a “branch
maze” that confuses the branch predictor.
●This is also tons of assembly: if you parse many different message types with
lots of different ﬁelds, the instruction cache will clog, leading to stalls.
●If you’re lucky, the switch turns into a jump table, but conditions must be just
right (e.g. contiguous case values).

BUT WAIT!

An interpreter VM is just
a giant switch-case…

Anatomy of a threaded interpreter

The best non-jitting interpreters use computed goto and threaded code, which
basically looks something like this:
1.Loop over instruction stream.
2.Load opcode, index into table of function pointers.
3.Execute indirect branch to instruction impl, update interpreter state.
4.Repeat.
This allows us to use the branch predictor to emulate icache for our interpreter.

vm.go
func Exec(prog []byte) {
vm := VM{prog: prog}
for !vm.halt {
opcode := prog[vm.pc] // Load opcode.
vm.pc++

thunk := thunks[opcode] // Select instruction thunk.
vm = thunk(vm) // Update VM state.
}
}

Instruction Thunks

An instruction thunk can do anything:
■Perform arithmetic on VM state.
■Update vm.pc (i.e. a branch instruction).
■Crash the VM by setting vm.halt (invalid opcode).
Modern CPUs are fantastic at predicting indirect branches, especially when there
is a single very hot branch with lots of targets. For example, Zen 4 can handle ~64
targets when a single very hot branch is involved without breaking a sweat.

Table-Driven Parsing:
Record tags as instruction opcodes

hyperpb: a Protobuf VM
●hyperpb is a Go Protobuf parser based on a VM that interprets Protobuf as
bytecode.
●“TDP” was pioneered by Google’s UPB library, and is used in Protobuf C++.
●hyperpb innovates on the idea, leveraging Go’s quirks to build the fastest
dynamic Protobuf parser available.

<github.com/bufbuild/hyperpb-go>

using_hyperpb.go
// Compile a type for your message, similar to compiling a regex.
msgType := hyperpb.CompileMessageDescriptor(...)

// Allocate a fresh message using that type.
msg := hyperpb.NewMessage(msgType)

// Parse the message!
if err := proto.Unmarshal(myData, msg); err != nil {
// Handle parse failure.
}

// Do stuff with msg using reflection.

Anatomy of hyperpb
●hyperpb is like a regex library, requiring you to compile optimized parsers at
runtime. hyperpb’s optimizing compiler is slow, but it’s a one-time cost.
●The compiler determines an optimal memory layout for the message and generates
conﬁguration for setting the VM’s opcode tables to parse for that message type.
●Unmarshalling calls into the VM, which looks just like the threaded, computed goto
VM we sketched earlier.
●Accessing ﬁelds requires Protobuf reﬂection, and the parsed value is immutable.
●All memory is manually managed by hyperpb using sophisticated arenas.

hyperpb is 2-3x faster than
Protobuf Go’s gencode

Throwing the compiler book at Protobuf
●hyperpb’s compiler is a genuine optimizing compiler, with multiple
intermediate representations and interprocedural (i.e., inter-message-type)
analyses.
●Fields are classiﬁed as “hot” or “cold”, inﬂuencing how memory is allocated
for them.
●Field representations are selected based many inputs; there are over 200 ﬁeld
representations to choose from, although only ~10 will be used by any given
workload.
●Linker resolves inter-message references at the end.

Inﬂuencing register allocation in the VM
●VM state consists of eight 64-bit words (split into vm.P1 and vm.P2 to work
around a Go compiler bug).
●All VM functions pass them as input and output: p1, p2 = f(p1, p2). VM
state is thus always kept in the ﬁrst eight argument registers.
●Extreme measures are taken to ensure VM state is never spilled to the stack.
This is a signiﬁcant latency reduction, by taking the core’s store queue out of
the equation.

Proﬁle-guided optimization
●The VM can record proﬁles to feed back into the compiler.
●This allows the compiler to learn which ﬁelds are hot and how large to
pre-allocate buffers.
●Reusing arenas effectively “proﬁles” memory usage, amortizing trips into Go’s
general-purpose allocator down to zero.

Zero-copy parsing
●Many ﬁelds can be decoded without copying any bytes!
●String ﬁelds, repeated ﬁxed-width ﬁelds, and repeated varint ﬁelds with small
values are all decoded without copying, resulting in absurd ~10 Gbps
throughputs on some benchmarks (e.g. large ML tensors).
●Zero-copy also enables pointer compression on string types, reducing overall
memory usage.

Learn more!
Performance should be super accessible and welcoming. I’ve done my part to
make learning how hyperpb works easy and fun.
●Read the code! I’ve meticulously commented it all! :)
●Read my blogpost! It’s this session in more detail.
●Read about arenas! Goes into detail on memory management.
●Read Buf’s announcement! More content coming soon :D

Thank you! Let’s connect.
Miguel Young de la Sota
[email protected]
@mcy.gay (Bluesky)
https://mcyoung.xyz

Parsing Protobuf as Fast as Possible by Miguel Young de la Sota

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Parsing Protobuf as Fast as Possible by Miguel Young de la Sota

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 23

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx