Go Faster: Tuning the Go Runtime for Latency and Throughput by Paweł Obrępalski

ScyllaDB 1 views 25 slides Oct 15, 2025

Slide 1 of 25

About This Presentation

Most Go services don’t need runtime tuning...until they do. At ShareChat, running hundreds of Go services across thousands of cores, we’ve seen real gains from understanding the scheduler and garbage collector under production load. This talk covers when tuning is worth it, how to use runtime va...

Size: 1.78 MB

Language: en

Added: Oct 15, 2025

Slides: 25 pages

Slide Content

A ScyllaDB Community
Go Faster: Tuning The Go
Runtime For Latency And
Throughput
Paweł Obrępalski
Staff Engineer

Paweł Obrępalski (he/him)

Staff Engineer @ ShareChat
■Aerospace researcher turned software engineer
■Focused on large scale recommender systems
■Leading delivery team at ShareChat
■Enjoys biking, gym and sauna

What will we talk about today?
■Go Runtime
●Scheduler
●Garbage Collection
■Observability
●Metrics
●Proﬁles
■Runtime Tuning
■Our Results

The Runtime: Your Silent Partner

■Beneﬁts
●Effortless concurrency - can manage millions of goroutines
●Automatic memory management
●Cross-platform compatibility
■Costs
●~2MB increase in binary size
●Additional startup latency
●Garbage Collection overhead (usually 1-3% CPU)
■The default behaviour is sensible for most of the workloads
●Check your code before runtime optimisations
●Can tune the behaviour by changing environment variables: GOMAXPROCS, GOGC , and GOMEMLIMIT

Multiplexing At Scale
■G-M-P model
●G: Goroutines - lightweight treads (2KB stack initially)
●M: OS threads - created as needed, reused later
●P: Processors - ﬁxed by GOMAXPROCS

Run Queues
■New goroutine:
●Put on local (max 256)
●If full: move half to global
■Empty local?
●Get from global
●Steal from others
■Blocking (e.g. I/O)?
●P switches to different M
●G back to Queue
■Sharing?
●Switch every 10ms

Garbage Collection
■Problem: Find which objects are not in use
■How to avoid halting the entire application?
●Concurrent Mark & Sweep
i.Markall of the active objects
ii.Brief stop the world to create write barriers
iii.Sweep (delete) all non-active objects
■GC runs alongside application
■Multiple workers in both stages
■Can tune the behaviour using GOGC/GOMEMLIMIT

Observability

Know When to Optimize
■You can’t optimize what you can’t see!
■Areas to cover:
●Application (rps, latency)
●Runtime (GC, goroutines, heap)
●System (CPU, memory, network)
■Key runtime metrics:
●/gc/cycles/total:gc-cycles - Collection frequency
●/memory/classes/heap/objects:bytes - Live heap
●/sched/latencies:seconds - Scheduling delays

Quick Start
package main

import (
"net/http"

"github.com/prometheus/client_golang/prometheus/promhttp"
)

func main() {
http.Handle("/metrics", promhttp.Handler())
// your application code
http.ListenAndServe(":2112", nil)
}
■Exposing your application metrics is just a few lines of code away
●Out of the box: go_gc, go_sched, go_memstats, go_threads, and much more…
■Can easily add custom ones (e.g. P50/P99 latency)

Proﬁling: Finding the Answers
■Run with Web UI
●go tool pprof -http=:90 localhost:80/debug/pprof/profile
●Will open UI on localhost:90 with data from your application running at port 80
■Several types:
●CPU (/profile)
●Memory (/heap)
●Allocations (/allocs)
●Mutex (/mutex)
●Goroutines (/goroutine)
■Exposing proﬁles on canary instances provides a quick way to observe actual usage
■Ideally you want to collect the proﬁles from different releases (continuous proﬁling)

Flamegraph - CPU
■Memory allocations: runtime.(*mheap)
■Garbage collection: runtime.gcBgMarkWorker / runtime.bgsweep
■Scheduler: runtime.schedule
■Your code: runtime.main

Flamegraph - Heap
■Identify where allocations happen
■Quickly ﬁnd memory leaks

Tips
■Keep objects local, avoid pointers when possible
■Check escapes with go build -gcflags=”-m”
■Consider object pooling
■Sawtooth memory usage -> excessive allocations
■Identify problems using heap proﬁles
■Run with with GODEBUG=gctrace=1 to expose GC details

Runtime Tuning

GOMAXPROCS for containers
■Setup:
●Container: 2 CPU limit
●Host: 8 CPUs
●GOMAXPROCS: 8
■Solution:
●Specify manually
●Use automaxprocs
●Upgrade to Go 1.25!

GOGC - Control GC Frequency
■Target heap memory = Live heap * (1 + GOGC/100)
■Higher GOGC value:
●GC runs less often
●Lower CPU usage
●Higher memory usage

Impact of different GOGC values - 50/100/200
Source: https://tip.golang.org/doc/gc-guide

GOMEMLIMIT - Avoid OOMs
■How it works?
●Increases GC frequency when getting near conﬁgured value
●Soft limit - Go does not guarantee staying under it
●Overrides GOGC when necessary
■Use when you have full control over execution environment (e.g. containers)
■Good starting point ~90% of available memory
■Pair high GOGC with GOMEMLIMIT

PGO: Let Production Guide Compilation
■Free performance! No code changes required
●Up to ~14% depending on the workload
●Biggest gains for compute-bound workloads
■How does it work?
●Analyze your apps CPU usage
●Inline hot functions more aggressively
■How do I use it?
●Collect proﬁles (e.g curl <your_service> > cpu.pprof )
●Build with PGO : go build -pgo=cpu.pprof
●Deploy and measure results!

Our results
■GOMAXPROCS
●Setting to # of available cores is a good starting point
●This usually gives the best throughput
●With higher values we’ve observed reduction in P50 but higher P99 latencies
●Observed up to 30% reduction in cost after tuning this parameter
■Experiences with PGO?
●Easy to start with, especially if you already gather proﬁles from production
●Mixed results - most of our services were I/O bound and did not beneﬁt much
●Longer build times - should be ﬁxed with Go 1.22+

Our results
■GOGC
●In extreme cases GC took over 40% CPU time
■Review heap proﬁles for leaks/ineﬃciencies and tune GOGC
●CPU usage with GOGC changes from one of biggest services (20k+ peak QPS):
■100: 40% CPU
■200: 21.5% CPU, 72% peak memory usage increase
■300: 15.9% CPU, 364% peak memory usage increase,
■500: 5% CPU, 780% peak memory usage increase
■Tuning GOGC/GOMEMLIMIT
● Average ~5% reduction in CPU usage, ~5% reduction in P99 latency
●We’ve found GOMEMLIMIT at ~90% and high GOGC to be suitable for most workloads
●Increasing memory may be a good trade-off for cost (1 CPU core ~4-5 GB RAM)

Stay Current, Stay Fast

■Incremental improvements over the versions
■1.21: PGO (Proﬁle Guided Optimisation)
■1.22: Improvements to runtime decreasing CPU overhead by 1-3%
■1.24
●Improvements to runtime decreasing CPU overhead by 2-3%
●New map implementation based on Swiss tables
■1.25
●Container-aware GOMAXPROCS!
●Experimental garbage collector (10-40% reduction in overhead in some workloads)

Key Takeaways
■Optimisations are continuous, iterative process
■Observability comes ﬁrst - You can’t optimise what you can’t see
■Go has great performance out of the box
■Tuning the runtime may provide additional beneﬁts
●Especially important at scale
●Easy to do with GOMAXPROCS, GOGC , and GOMEMLIMIT
■Stay up to date with latest Go versions for free performance

Thank you! Let’s connect.
Paweł Obrępalski
[email protected]
linkedin.com/in/obrepalski/
obrepalski.com

Go Faster: Tuning the Go Runtime for Latency and Throughput by Paweł Obrępalski

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Go Faster: Tuning the Go Runtime for Latency and Throughput by Paweł Obrępalski

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx