Profile-Guided Optimization (PGO): (Ab)using it for Fun and Profit

ScyllaDB 328 views 20 slides Oct 11, 2024
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

Discover how to boost your software with lesser-known compiler flags and Profile-Guided Optimization (PGO). Learn what PGO is, how it works, its nuances, and see real benchmarks. Ready to delegate optimization to your compiler and gain performance? Join us! #dev #programming #optimization


Slide Content

A ScyllaDB Community
Profile-Guided Optimization (PGO):
(Ab)using it for Fun and Profit
Alexander Zaitsev
CTO at Cytopus

Alexander Zaitsev (He/Him)

CTO at Cytopus
■Optimized tens of projects with PGO
■All about performance - that’s why I like P99!
■Interested in compiler optimizations
■Hiking and playing videogames

PGO benchmarks in practice
Application Improvement
PostgreSQL up to +15% faster queries
SQLite up to +20% faster queries
ClickHouse up to +13% QPS
MySQL up to +35% QPS
MariaDB up to +23% TPS
MongoDB up to 2x faster queries
Redis up to +70% RPS
Clang up to +20% compilation speed
GCC up to +7-12% compilation speed
Application Improvement
Envoy up to +20% RPS
HAProxy up to +5% RPS
Vector +15% EPS
Rsyslog +11% EPS
Fluent-Bit +23% EPS
Chromium up to +12% faster
Firefox up to +12% faster
CPython up to +13% speedup
Rustc up to +15% compilation speed

Build optimization pipeline
4
Ahead-of-Time (AOT)
VCS Compiler Target machine
Source
code
Machine
code
CI/CD
Just-in-Time (JIT)
VCS Compiler
Target machine
+
Virtual machine
Source
code
Bytecode
CI/CD

Compiler optimizations and runtime information
■Hot/cold code splitting
■Inlining
■Loop roll/unroll
■Link-Time Optimization (LTO)
■And many other funny things!

Many compiler optimizations can be improved
by providing runtime execution statistics!

The solution - Profile-Guided Optimization
■Collect runtime statistics on a target machine
■Pass it to a compiler
■Use the profile during the compilation phase

PGO kinds
■Instrumentation
■Sampling
■Different flavours and combinations (like CSIR PGO)
■Specific PGO types (like Temporal PGO)

How does Instrumentation PGO work?
■Compile your application in the Instrumentation mode
■Run the instrumented application on your typical workload
■Collect PGO profiles
■Recompile your application once again with the PGO profiles
■…
■Enjoy your PGO-optimized application!

Instrumentation PGO: caveats
■You need to compile your application at least twice: for instrumentation and
then for the actual optimization
■An instrumented binary is larger
■An instrumented binary is slower

Binary size increase table
Application Release size Instrumented size Ratio
ClickHouse 2.0 Gib 2.8 Gib 1.4x
MongoDB 151 Mib 255 Mib 1.69x
SQLite 1.8 Mib 2.7 Mib 1.5x
Nginx 3.8 Mib 4.3 Mib 1.13x
curl 1.1 Mib 1.4 Mib 1.27x
Vector 198 Mib 286 Mib 1.44x
HAProxy 13 Mib 17 Mib 1.3x

Binary slowdown table
Application Instrumented to Release slowdown ratio
ClickHouse 311x
Tarantool 1.5x
HAProxy 1.20x
Fluent-Bit 1.48x
Vector 14x
clang-tidy 2.28x
lld 6.8x

How does Sampling PGO work?
■Run your usual application
■Collect runtime information via an (external) profiler (like Linux perf or Intel
VTune)
■Recompile your application once again with runtime information
■…
■Enjoy your PGO-optimized application!

Sampling PGO: caveats
■BSS hardware support can lead (and leads) to better results but can be
unavailable in your hardware/OS
■Limited tooling support
●For C, C++, Rust, Fortran the only options are Google AutoFDO (buggy) or llvm-profgen (works
only with BSS)
●For Go compiler it’s pprof - works quite well

Instrumentation vs Sampling
■Instrumentation allows to achieve better optimization
●According to Google: Sampling PGO has ~80-90% efficiency of Instrumentation PGO
■Sampling has far less runtime overhead
●~2% with Sampling compared to +inf with Instrumentation
●You can even tweak the amount of overhead via a sampling rate
■Instrumentation has better compiler, tooling and OS support

How should I collect PGO profiles?
■From unit-tests - please don’t!
■From benchmarks - it depends
■From manually-crafted training scenario - good option
■From production - great option (but please be careful)

Current PGO state across languages
■C, C++ - the maturest existing implementations in the world
■Rust, D, other GCC or LLVM-based compilers - almost the same as C++ but without some of
the most advanced PGO features
■Go (official compiler) - supports but can do little (for now)
■C# - supports, called “Dynamic PGO”
■GraalVM targets (Java, Kotlin and other) - supports, not enough publicly available
information :(
■Other languages/compilers - with 0.999(9) probability PGO is not supported

Continuous Profile-Guided Optimization
■AFAIK, the only thing right now is Google Wide Profiler (GWP) based solution,
closed-source
■There is no ready-to-use open-source solution yet
■There is an idea about making such a platform as a part of Grafana
Pyroscope or Elasticsearch Universal Profiling

Do you want more advanced techniques? Yeah!
■Post-Link Optimization (PLO): LLVM BOLT, Propeller, Intel TLO
■Application-Specific Operating Systems (ASOS)
■Or even Application-Specific Interpreters (I did this too, huh)
■Machine-Learning based compilers

Links
■Awesome PGO - https://github.com/zamazan4ik/awesome-pgo
■Clang documentation about PGO - link
■LLVM Discord (#profiling)

Thank you! Let’s connect.
Alexander Zaitsev
[email protected]
@zamazan4ik
https://zamazan4ik.github.io
Tags