Go Faster: Tuning the Go Runtime for Latency and Throughput by Paweł Obrępalski
ScyllaDB
1 views
25 slides
Oct 15, 2025
Slide 1 of 25
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
About This Presentation
Most Go services don’t need runtime tuning...until they do. At ShareChat, running hundreds of Go services across thousands of cores, we’ve seen real gains from understanding the scheduler and garbage collector under production load. This talk covers when tuning is worth it, how to use runtime va...
Most Go services don’t need runtime tuning...until they do. At ShareChat, running hundreds of Go services across thousands of cores, we’ve seen real gains from understanding the scheduler and garbage collector under production load. This talk covers when tuning is worth it, how to use runtime variables like GOGC, GOMEMLIMIT, and GOMAXPROCS, and how to apply Profile-Guided Optimisation. We’ll also share lightweight monitoring strategies and the performance improvements we achieved in production.
Size: 1.78 MB
Language: en
Added: Oct 15, 2025
Slides: 25 pages
Slide Content
A ScyllaDB Community
Go Faster: Tuning The Go
Runtime For Latency And
Throughput
Paweł Obrępalski
Staff Engineer
Paweł Obrępalski (he/him)
Staff Engineer @ ShareChat
■Aerospace researcher turned software engineer
■Focused on large scale recommender systems
■Leading delivery team at ShareChat
■Enjoys biking, gym and sauna
What will we talk about today?
■Go Runtime
●Scheduler
●Garbage Collection
■Observability
●Metrics
●Profiles
■Runtime Tuning
■Our Results
The Runtime: Your Silent Partner
■Benefits
●Effortless concurrency - can manage millions of goroutines
●Automatic memory management
●Cross-platform compatibility
■Costs
●~2MB increase in binary size
●Additional startup latency
●Garbage Collection overhead (usually 1-3% CPU)
■The default behaviour is sensible for most of the workloads
●Check your code before runtime optimisations
●Can tune the behaviour by changing environment variables: GOMAXPROCS, GOGC , and GOMEMLIMIT
Multiplexing At Scale
■G-M-P model
●G: Goroutines - lightweight treads (2KB stack initially)
●M: OS threads - created as needed, reused later
●P: Processors - fixed by GOMAXPROCS
Run Queues
■New goroutine:
●Put on local (max 256)
●If full: move half to global
■Empty local?
●Get from global
●Steal from others
■Blocking (e.g. I/O)?
●P switches to different M
●G back to Queue
■Sharing?
●Switch every 10ms
Garbage Collection
■Problem: Find which objects are not in use
■How to avoid halting the entire application?
●Concurrent Mark & Sweep
i.Markall of the active objects
ii.Brief stop the world to create write barriers
iii.Sweep (delete) all non-active objects
■GC runs alongside application
■Multiple workers in both stages
■Can tune the behaviour using GOGC/GOMEMLIMIT
Observability
Know When to Optimize
■You can’t optimize what you can’t see!
■Areas to cover:
●Application (rps, latency)
●Runtime (GC, goroutines, heap)
●System (CPU, memory, network)
■Key runtime metrics:
●/gc/cycles/total:gc-cycles - Collection frequency
●/memory/classes/heap/objects:bytes - Live heap
●/sched/latencies:seconds - Scheduling delays
func main() {
http.Handle("/metrics", promhttp.Handler())
// your application code
http.ListenAndServe(":2112", nil)
}
■Exposing your application metrics is just a few lines of code away
●Out of the box: go_gc, go_sched, go_memstats, go_threads, and much more…
■Can easily add custom ones (e.g. P50/P99 latency)
Profiling: Finding the Answers
■Run with Web UI
●go tool pprof -http=:90 localhost:80/debug/pprof/profile
●Will open UI on localhost:90 with data from your application running at port 80
■Several types:
●CPU (/profile)
●Memory (/heap)
●Allocations (/allocs)
●Mutex (/mutex)
●Goroutines (/goroutine)
■Exposing profiles on canary instances provides a quick way to observe actual usage
■Ideally you want to collect the profiles from different releases (continuous profiling)
Tips
■Keep objects local, avoid pointers when possible
■Check escapes with go build -gcflags=”-m”
■Consider object pooling
■Sawtooth memory usage -> excessive allocations
■Identify problems using heap profiles
■Run with with GODEBUG=gctrace=1 to expose GC details
Runtime Tuning
GOMAXPROCS for containers
■Setup:
●Container: 2 CPU limit
●Host: 8 CPUs
●GOMAXPROCS: 8
■Solution:
●Specify manually
●Use automaxprocs
●Upgrade to Go 1.25!
GOGC - Control GC Frequency
■Target heap memory = Live heap * (1 + GOGC/100)
■Higher GOGC value:
●GC runs less often
●Lower CPU usage
●Higher memory usage
Impact of different GOGC values - 50/100/200
Source: https://tip.golang.org/doc/gc-guide
GOMEMLIMIT - Avoid OOMs
■How it works?
●Increases GC frequency when getting near configured value
●Soft limit - Go does not guarantee staying under it
●Overrides GOGC when necessary
■Use when you have full control over execution environment (e.g. containers)
■Good starting point ~90% of available memory
■Pair high GOGC with GOMEMLIMIT
PGO: Let Production Guide Compilation
■Free performance! No code changes required
●Up to ~14% depending on the workload
●Biggest gains for compute-bound workloads
■How does it work?
●Analyze your apps CPU usage
●Inline hot functions more aggressively
■How do I use it?
●Collect profiles (e.g curl <your_service> > cpu.pprof )
●Build with PGO : go build -pgo=cpu.pprof
●Deploy and measure results!
Our results
■GOMAXPROCS
●Setting to # of available cores is a good starting point
●This usually gives the best throughput
●With higher values we’ve observed reduction in P50 but higher P99 latencies
●Observed up to 30% reduction in cost after tuning this parameter
■Experiences with PGO?
●Easy to start with, especially if you already gather profiles from production
●Mixed results - most of our services were I/O bound and did not benefit much
●Longer build times - should be fixed with Go 1.22+
Our results
■GOGC
●In extreme cases GC took over 40% CPU time
■Review heap profiles for leaks/inefficiencies and tune GOGC
●CPU usage with GOGC changes from one of biggest services (20k+ peak QPS):
■100: 40% CPU
■200: 21.5% CPU, 72% peak memory usage increase
■300: 15.9% CPU, 364% peak memory usage increase,
■500: 5% CPU, 780% peak memory usage increase
■Tuning GOGC/GOMEMLIMIT
● Average ~5% reduction in CPU usage, ~5% reduction in P99 latency
●We’ve found GOMEMLIMIT at ~90% and high GOGC to be suitable for most workloads
●Increasing memory may be a good trade-off for cost (1 CPU core ~4-5 GB RAM)
Stay Current, Stay Fast
■Incremental improvements over the versions
■1.21: PGO (Profile Guided Optimisation)
■1.22: Improvements to runtime decreasing CPU overhead by 1-3%
■1.24
●Improvements to runtime decreasing CPU overhead by 2-3%
●New map implementation based on Swiss tables
■1.25
●Container-aware GOMAXPROCS!
●Experimental garbage collector (10-40% reduction in overhead in some workloads)
Key Takeaways
■Optimisations are continuous, iterative process
■Observability comes first - You can’t optimise what you can’t see
■Go has great performance out of the box
■Tuning the runtime may provide additional benefits
●Especially important at scale
●Easy to do with GOMAXPROCS, GOGC , and GOMEMLIMIT
■Stay up to date with latest Go versions for free performance
Thank you! Let’s connect.
Paweł Obrępalski [email protected]
linkedin.com/in/obrepalski/
obrepalski.com