Translations at Scale: Memory Optimization Techniques That Kept Uber's P99 Under 1ms by Cristian Velazquez

ScyllaDB 0 views 20 slides Oct 08, 2025
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

Discover how Uber reimagined its translation service by shifting from a purely in-memory approach to a hybrid memory-disk architecture. Learn how this transformation reduced memory usage by 95% during initial load, maintained sub-millisecond P99 performance, and achieved a 99.5% cache hit ratio. We&...


Slide Content

A ScyllaDB Community
Translations at Scale: Memory
Optimization Techniques That
Kept Uber's P99 Under 1ms
Cristian Velazquez
Sr Staff Software Engineer
D16758573

Cristian Velazquez

Sr Staff Software Engineer at Uber
■Efficiency:
●Garbage collection tuning for Java and Go services
which have saved the company >$10M dollars.
●Distributed load testing.
●Metrics emission, disk based cache solutions,
latency tuning.
■When I am not at work I enjoy spending time with my family
and playing video games.

What was the initial problem?

High GC on some services
■Let's look at one example:


What options do we have?
■Give more memory to the service? Expensive
■Optimize object allocation.

Heap profile
After running a heap profile in the service we
observe high object allocation from our
translation's library.

Let's load the same translations locally
8M+ objects created.

Let's give some context
Translations at Uber are:
■Stored in "repositories" (collection of translations.)
■They are uploaded to a blob storage service.
■Downloaded by every container to serve the translations from memory.
■Re-downloaded and reloaded every 5 hours to keep them fresh.
■Updates are done atomically. So there are a couple of seconds (up to a
minute) where we have 2 repositories loaded.

The services having issues had repositories with more than 2M permutations
(each key translated to 50+ languages). The culprit of the issue:
■sync.Map

Let's analyze the issue
type Map struct {
read
atomic.Pointer[readOnly]
dirty map[any]*entry
}

type readOnly struct {
m map[any]*entry
amended bool
}

type entry struct {
p atomic.Pointer[any]
}


For the purposes of today's discussion,
imagine that having map[string]string is
enough. No additional pointers.

A single entry needs:
■Key:
●1 object for any.
■Value:
●1 object for *entry.
●1 object for Pointer to any.
●1 object for any.

How we fixed it?
■Simple map[string]Translation with RWMutex.
●Concurrent writes are extremely rare.
Impact:
■Local app:

■Service cpu before/after:
●Yellow is -7d.
●Lower is better.
●20% improvement

So are we done?

Remaining issues
■OOMs during reloads.
●Reloads are atomic, so we end up with 2 copies during a
few seconds.
■Still some GC issues in services with low memory.

Potential solutions
■Many services only need a small subset of a repository. So create a copy
with only the subset?
●Costly to maintain.
■External translation service.
●Costly (we do millions of translations per second).
●Latency.
●Error handling.
■Disk based solution?
●We decided to try this one.

Disk based solution

Hybrid solution
The key design features are:
■Custom schema model to store on disk (near zero cost to
serialize/deserialize).
■All key's hash codes in memory (5 bytes gives <0.01% collision rate).
■256 static sharding to reduce lock contention when loading keys.
●Each shard maps to a different file.
■Each value item contains:
●Offset on the file (3 bytes -> 16MB).
■Compressed offsets (offset >> 3 -> 128MB).
●Index to array with loaded entries (2 bytes).
■LRU for evicting keys

Let's look at the previous example
■2M+ translations use to need ~800MB to load.
●Reloads meant 1.6GB.
■New approach takes only 40MB to load all keys.
●We allow only ~50k keys (configurable) on memory which is 40-80MB.
●Reloads now mean 80-120MB + 40MB of the new cache.

Challenges
■Reloads impact hit ratio.
●Store the hash codes of all the current items on memory and use it to populate the new
in-memory cache.
■Per-shard vs global LRU.
●Global is expensive due to lock contention.
●Per shard is not accurate (you can have most frequent items in just a few shards).
●We went with global but only updating the entry if the last update was more than a second
ago.
■High cpu usage for getting the unix time.
●Background thread that updates our local time (int64 variable).
■Reloads don't need all the previous items.
●Now with time information we only load the last 15 minutes of translations, this improves
memory and cache eviction after reloading.

Impact

But, what about latency?

GOGCTuner Blog
Ballast blog
Thank you! Let’s connect.
Cristian Velazquez
[email protected]
linkedin.com/in/cdvr1993
Shadower blog
Presto GC tuning
Tags