Gmetrics: Processing Metrics at Uber Scale by Cristian Velazquez
ScyllaDB
114 views
19 slides
Mar 07, 2025
Slide 1 of 19
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
About This Presentation
Gmetrics is the new library that Uber uses to process metrics that addresses many of the pain points around processing metrics as scale for Uber. Cristian provides an inside look at the migration, which involved rewriting everything for Java and Go. He also shares how they are doing this migration s...
Gmetrics is the new library that Uber uses to process metrics that addresses many of the pain points around processing metrics as scale for Uber. Cristian provides an inside look at the migration, which involved rewriting everything for Java and Go. He also shares how they are doing this migration safely and transparently for users.
Size: 2.83 MB
Language: en
Added: Mar 07, 2025
Slides: 19 pages
Slide Content
A ScyllaDB Community
Gmetrics: processing metrics at
Uber scale
Cristian Velazquez
Staff Software Engineer
Cristian Velazquez
■Efficiency:
■Garbage collection tuning for Java™ and Go™
services which have saved the company >$10M
dollars
■Distributed load testing
■Metrics emission, disk based cache solutions,
latency tuning
■When I am not at work I enjoy spending time with my
family and playing video games
■What is Tally?
■What is Gmetrics?
■Highlights
■What didn't work as expected?
■Blockers
■Safe rollout
■Impact
■Next steps
Agenda
What is Tally?
What is Tally?
Open source library to emit metrics developed by
Uber in 2015.
■It supports m3 and prometheus reporters.
■It supports different metric types:
■Counters, gauges, timers and histograms.
■We have code using a library older than
Tally.
■We used to use statd.
■Created wrappers for the old library to use
Tally underneath
Gmetrics
What is Gmetrics?
New library for metrics that is a drop-in replacement
for Tally. We have Java™ and Go™ versions but today
we are only analyzing Go™.
Highlights:
■Registryless.
■Reduced Garbage Collection overhead.
■Reduced lock contention.
■Reduced thrift encoding.
■Reduced CPU when emitting metrics
Registryless
Tally
Gmetrics
Reduced Garbage Collection overhead
■Gmetrics uses a single async thread
approach to process metrics.
■The async thread is capable of doing its
own memory management.
■So if you have 10 tags, that means 20 objects
that need to be tracked, right?
■No, Gmetrics creates the canonical form in
thrift bytes, so 20 tags become a single
buffer.
■Additionally, it creates buffers of 32KB, so it
can pack multiple metrics together.
Reduced lock contention
■Tally needs locks to synchronize its
registry.
■Most of the time uses a read lock.
■Iterating the registry when emitting metrics.
■Getting a metric already seen.
■Creating a new metric requires a write lock.
Imagine:
■A 100k registry emission and suddenly we need
a write lock?
■Or an expensive write lock (too many tags).
■Gmetrics uses a per-proc buffer of pointers.
Reduced thrift encoding
■Tally maps a metric to an atomic variable.
■Then it maps this to a tally objects.
■Then it encodes it to thrift bytes.
■Gmetrics maps a metric to a normal variable.
■The mapping already includes >90% of the thrift encoding because we have
already pre-generated the first time we see a metric.
■Only the metric value is missing.
Reduced CPU when emitting metrics
Tally has to go through every single metric in its cache to see if it has
changed.
Counters Timers Gauges Histograms
Scope1 100 150 90 100
Scope2 50 170 120 100
Scope3 200 300 180 100
-inf-1ms 1-2ms 2-3ms ...
Histogram1 Has changed? Has changed? Has changed? Has changed?
Histogram2 Has changed? Has changed? Has changed? Has changed?
What didn't work as expected? (tagging cache)
■Gmetrics uses 2 level caching:
■First level maps incoming unordered tags to canonical
representation of the tags.
■Second level maps canonical version to final metric
(counter, gauge, etc).
■API uses map for tagging and completely random
ordering is expected.
■A metric with > 8 tags could generate > 300k
combinations.
■Fix?
■Added insertion sort to tags.
What didn't work as expected? (histogram)
■Histograms are counters with 2 additional tags:
bucket and bucketid.
■A histogram has an array of buckets that specify the
ranges of the emitted values.
■Generating these 2 tags is expensive because you
need to convert a number to a string.
■Fix?
■Put the index of the bucket in the buffer as raw bytes.
■Generate the 2 tags only when generating the
canonical thrift bytes.
■A histogram gets updated many times but it is
inserted only once between metric emission.
Blockers
Remember the old library that we were using
before Tally?
■Gmetrics is registryless so the results from
tagging always returns a new memory address.
■It created a memory leak.
■Fix: do not use cache and just return a new
wrapper every time.
type OldLibrary struct {
CachedScopes map[tally.Scope]*OldLibraryScope
}
Safe rollout
How can we safely roll this out?
■Uber has a configuration system call Flipr.
■During program initialization check Flipr to
select Tally or Gmetrics.
■Flipr provides different conditions like:
region, zone, environment, tier, etc.
Impact
Next steps
■Provide new APIs that:
■Avoid map[string]string for tags because creating maps
and iterating them is expensive.
■Allow pooling to reduce allocations and garbage
collection even further.
type Scope interface {
Tagged(tags map[string]string) Scope
Counter(name string) Counter
}