Go Faster: Tuning the Go Runtime for Latency and Throughput by Paweł Obrępalski

ScyllaDB 1 views 25 slides Oct 15, 2025
Slide 1
Slide 1 of 25
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25

About This Presentation

Most Go services don’t need runtime tuning...until they do. At ShareChat, running hundreds of Go services across thousands of cores, we’ve seen real gains from understanding the scheduler and garbage collector under production load. This talk covers when tuning is worth it, how to use runtime va...


Slide Content

A ScyllaDB Community
Go Faster: Tuning The Go
Runtime For Latency And
Throughput
Paweł Obrępalski
Staff Engineer

Paweł Obrępalski (he/him)

Staff Engineer @ ShareChat
■Aerospace researcher turned software engineer
■Focused on large scale recommender systems
■Leading delivery team at ShareChat
■Enjoys biking, gym and sauna

What will we talk about today?
■Go Runtime
●Scheduler
●Garbage Collection
■Observability
●Metrics
●Profiles
■Runtime Tuning
■Our Results

The Runtime: Your Silent Partner

■Benefits
●Effortless concurrency - can manage millions of goroutines
●Automatic memory management
●Cross-platform compatibility
■Costs
●~2MB increase in binary size
●Additional startup latency
●Garbage Collection overhead (usually 1-3% CPU)
■The default behaviour is sensible for most of the workloads
●Check your code before runtime optimisations
●Can tune the behaviour by changing environment variables: GOMAXPROCS, GOGC , and GOMEMLIMIT

Multiplexing At Scale
■G-M-P model
●G: Goroutines - lightweight treads (2KB stack initially)
●M: OS threads - created as needed, reused later
●P: Processors - fixed by GOMAXPROCS

Run Queues
■New goroutine:
●Put on local (max 256)
●If full: move half to global
■Empty local?
●Get from global
●Steal from others
■Blocking (e.g. I/O)?
●P switches to different M
●G back to Queue
■Sharing?
●Switch every 10ms

Garbage Collection
■Problem: Find which objects are not in use
■How to avoid halting the entire application?
●Concurrent Mark & Sweep
i.Markall of the active objects
ii.Brief stop the world to create write barriers
iii.Sweep (delete) all non-active objects
■GC runs alongside application
■Multiple workers in both stages
■Can tune the behaviour using GOGC/GOMEMLIMIT

Observability

Know When to Optimize
■You can’t optimize what you can’t see!
■Areas to cover:
●Application (rps, latency)
●Runtime (GC, goroutines, heap)
●System (CPU, memory, network)
■Key runtime metrics:
●/gc/cycles/total:gc-cycles - Collection frequency
●/memory/classes/heap/objects:bytes - Live heap
●/sched/latencies:seconds - Scheduling delays

Quick Start
package main

import (
"net/http"

"github.com/prometheus/client_golang/prometheus/promhttp"
)

func main() {
http.Handle("/metrics", promhttp.Handler())
// your application code
http.ListenAndServe(":2112", nil)
}
■Exposing your application metrics is just a few lines of code away
●Out of the box: go_gc, go_sched, go_memstats, go_threads, and much more…
■Can easily add custom ones (e.g. P50/P99 latency)

Profiling: Finding the Answers
■Run with Web UI
●go tool pprof -http=:90 localhost:80/debug/pprof/profile
●Will open UI on localhost:90 with data from your application running at port 80
■Several types:
●CPU (/profile)
●Memory (/heap)
●Allocations (/allocs)
●Mutex (/mutex)
●Goroutines (/goroutine)
■Exposing profiles on canary instances provides a quick way to observe actual usage
■Ideally you want to collect the profiles from different releases (continuous profiling)

Flamegraph - CPU
■Memory allocations: runtime.(*mheap)
■Garbage collection: runtime.gcBgMarkWorker / runtime.bgsweep
■Scheduler: runtime.schedule
■Your code: runtime.main

Flamegraph - Heap
■Identify where allocations happen
■Quickly find memory leaks

Tips
■Keep objects local, avoid pointers when possible
■Check escapes with go build -gcflags=”-m”
■Consider object pooling
■Sawtooth memory usage -> excessive allocations
■Identify problems using heap profiles
■Run with with GODEBUG=gctrace=1 to expose GC details

Runtime Tuning

GOMAXPROCS for containers
■Setup:
●Container: 2 CPU limit
●Host: 8 CPUs
●GOMAXPROCS: 8
■Solution:
●Specify manually
●Use automaxprocs
●Upgrade to Go 1.25!

GOGC - Control GC Frequency
■Target heap memory = Live heap * (1 + GOGC/100)
■Higher GOGC value:
●GC runs less often
●Lower CPU usage
●Higher memory usage

Impact of different GOGC values - 50/100/200
Source: https://tip.golang.org/doc/gc-guide

GOMEMLIMIT - Avoid OOMs
■How it works?
●Increases GC frequency when getting near configured value
●Soft limit - Go does not guarantee staying under it
●Overrides GOGC when necessary
■Use when you have full control over execution environment (e.g. containers)
■Good starting point ~90% of available memory
■Pair high GOGC with GOMEMLIMIT

PGO: Let Production Guide Compilation
■Free performance! No code changes required
●Up to ~14% depending on the workload
●Biggest gains for compute-bound workloads
■How does it work?
●Analyze your apps CPU usage
●Inline hot functions more aggressively
■How do I use it?
●Collect profiles (e.g curl <your_service> > cpu.pprof )
●Build with PGO : go build -pgo=cpu.pprof
●Deploy and measure results!

Our results
■GOMAXPROCS
●Setting to # of available cores is a good starting point
●This usually gives the best throughput
●With higher values we’ve observed reduction in P50 but higher P99 latencies
●Observed up to 30% reduction in cost after tuning this parameter
■Experiences with PGO?
●Easy to start with, especially if you already gather profiles from production
●Mixed results - most of our services were I/O bound and did not benefit much
●Longer build times - should be fixed with Go 1.22+

Our results
■GOGC
●In extreme cases GC took over 40% CPU time
■Review heap profiles for leaks/inefficiencies and tune GOGC
●CPU usage with GOGC changes from one of biggest services (20k+ peak QPS):
■100: 40% CPU
■200: 21.5% CPU, 72% peak memory usage increase
■300: 15.9% CPU, 364% peak memory usage increase,
■500: 5% CPU, 780% peak memory usage increase
■Tuning GOGC/GOMEMLIMIT
● Average ~5% reduction in CPU usage, ~5% reduction in P99 latency
●We’ve found GOMEMLIMIT at ~90% and high GOGC to be suitable for most workloads
●Increasing memory may be a good trade-off for cost (1 CPU core ~4-5 GB RAM)

Stay Current, Stay Fast

■Incremental improvements over the versions
■1.21: PGO (Profile Guided Optimisation)
■1.22: Improvements to runtime decreasing CPU overhead by 1-3%
■1.24
●Improvements to runtime decreasing CPU overhead by 2-3%
●New map implementation based on Swiss tables
■1.25
●Container-aware GOMAXPROCS!
●Experimental garbage collector (10-40% reduction in overhead in some workloads)

Key Takeaways
■Optimisations are continuous, iterative process
■Observability comes first - You can’t optimise what you can’t see
■Go has great performance out of the box
■Tuning the runtime may provide additional benefits
●Especially important at scale
●Easy to do with GOMAXPROCS, GOGC , and GOMEMLIMIT
■Stay up to date with latest Go versions for free performance

Thank you! Let’s connect.
Paweł Obrępalski
[email protected]
linkedin.com/in/obrepalski/
obrepalski.com
Tags