Translations at Scale: Memory Optimization Techniques That Kept Uber's P99 Under 1ms by Cristian Velazquez

ScyllaDB 0 views 20 slides Oct 08, 2025

Slide 1 of 20

About This Presentation

Discover how Uber reimagined its translation service by shifting from a purely in-memory approach to a hybrid memory-disk architecture. Learn how this transformation reduced memory usage by 95% during initial load, maintained sub-millisecond P99 performance, and achieved a 99.5% cache hit ratio. We&...

Size: 2.78 MB

Language: en

Added: Oct 08, 2025

Slides: 20 pages

Slide Content

A ScyllaDB Community
Translations at Scale: Memory
Optimization Techniques That
Kept Uber's P99 Under 1ms
Cristian Velazquez
Sr Staff Software Engineer
D16758573

Cristian Velazquez

Sr Staff Software Engineer at Uber
■Eﬃciency:
●Garbage collection tuning for Java and Go services
which have saved the company >$10M dollars.
●Distributed load testing.
●Metrics emission, disk based cache solutions,
latency tuning.
■When I am not at work I enjoy spending time with my family
and playing video games.

What was the initial problem?

High GC on some services
■Let's look at one example:

What options do we have?
■Give more memory to the service? Expensive
■Optimize object allocation.

Heap proﬁle
After running a heap proﬁle in the service we
observe high object allocation from our
translation's library.

Let's load the same translations locally
8M+ objects created.

Let's give some context
Translations at Uber are:
■Stored in "repositories" (collection of translations.)
■They are uploaded to a blob storage service.
■Downloaded by every container to serve the translations from memory.
■Re-downloaded and reloaded every 5 hours to keep them fresh.
■Updates are done atomically. So there are a couple of seconds (up to a
minute) where we have 2 repositories loaded.

The services having issues had repositories with more than 2M permutations
(each key translated to 50+ languages). The culprit of the issue:
■sync.Map

Let's analyze the issue
type Map struct {
read
atomic.Pointer[readOnly]
dirty map[any]*entry
}

type readOnly struct {
m map[any]*entry
amended bool
}

type entry struct {
p atomic.Pointer[any]
}

For the purposes of today's discussion,
imagine that having map[string]string is
enough. No additional pointers.

A single entry needs:
■Key:
●1 object for any.
■Value:
●1 object for *entry.
●1 object for Pointer to any.
●1 object for any.

How we ﬁxed it?
■Simple map[string]Translation with RWMutex.
●Concurrent writes are extremely rare.
Impact:
■Local app:

■Service cpu before/after:
●Yellow is -7d.
●Lower is better.
●20% improvement

So are we done?

Remaining issues
■OOMs during reloads.
●Reloads are atomic, so we end up with 2 copies during a
few seconds.
■Still some GC issues in services with low memory.

Potential solutions
■Many services only need a small subset of a repository. So create a copy
with only the subset?
●Costly to maintain.
■External translation service.
●Costly (we do millions of translations per second).
●Latency.
●Error handling.
■Disk based solution?
●We decided to try this one.

Disk based solution

Hybrid solution
The key design features are:
■Custom schema model to store on disk (near zero cost to
serialize/deserialize).
■All key's hash codes in memory (5 bytes gives <0.01% collision rate).
■256 static sharding to reduce lock contention when loading keys.
●Each shard maps to a different ﬁle.
■Each value item contains:
●Offset on the ﬁle (3 bytes -> 16MB).
■Compressed offsets (offset >> 3 -> 128MB).
●Index to array with loaded entries (2 bytes).
■LRU for evicting keys

Let's look at the previous example
■2M+ translations use to need ~800MB to load.
●Reloads meant 1.6GB.
■New approach takes only 40MB to load all keys.
●We allow only ~50k keys (conﬁgurable) on memory which is 40-80MB.
●Reloads now mean 80-120MB + 40MB of the new cache.

Challenges
■Reloads impact hit ratio.
●Store the hash codes of all the current items on memory and use it to populate the new
in-memory cache.
■Per-shard vs global LRU.
●Global is expensive due to lock contention.
●Per shard is not accurate (you can have most frequent items in just a few shards).
●We went with global but only updating the entry if the last update was more than a second
ago.
■High cpu usage for getting the unix time.
●Background thread that updates our local time (int64 variable).
■Reloads don't need all the previous items.
●Now with time information we only load the last 15 minutes of translations, this improves
memory and cache eviction after reloading.

Impact

But, what about latency?

GOGCTuner Blog
Ballast blog
Thank you! Let’s connect.
Cristian Velazquez
[email protected]
linkedin.com/in/cdvr1993
Shadower blog
Presto GC tuning

Translations at Scale: Memory Optimization Techniques That Kept Uber's P99 Under 1ms by Cristian Velazquez

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Translations at Scale: Memory Optimization Techniques That Kept Uber&#39;s P99 Under 1ms by Cristian Velazquez

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 20

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx

Translations at Scale: Memory Optimization Techniques That Kept Uber's P99 Under 1ms by Cristian Velazquez