Timeseries Storage at Ludicrous Speed by Duarte Nunes
ScyllaDB
2 views
21 slides
Oct 09, 2025
Slide 1 of 21
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
About This Presentation
Datadog’s real-time storage system for timeseries data ingests billions of points per second and serves thousands of queries per second with sub-second latencies. This session will describe how the system is designed, how it has evolved over time, and the techniques we use today to meet our perfor...
Datadog’s real-time storage system for timeseries data ingests billions of points per second and serves thousands of queries per second with sub-second latencies. This session will describe how the system is designed, how it has evolved over time, and the techniques we use today to meet our performance goals.
Size: 2.97 MB
Language: en
Added: Oct 09, 2025
Slides: 21 pages
Slide Content
A ScyllaDB Community
Timeseries storage at
ludicrous speed
Duarte Nunes
Staff Engineer
Duarte Nunes (he/him)
Staff Engineer at Datadog
■Background in distributed systems
■Rewrote way too many bespoke databases in Rust
■Into landscape/wildlife photography and
letterpressed books
Billions datapoints / second
Millions of hosts
High-level overview of the Metrics Platform
An RTDB cluster
Inside an RTDB node
M nocle
Data Model: (org, metric, hash(tags)) -> [(timestamp, value)]
Thread-per-coreish
A Monocle worker instance
Memtable
A Monocle file set
Files & Compaction
Cache
Throttling queries under load
Gated queries and cost-based scheduling
■Gates allow admission based on:
●Ingestion lag
●Available memory
●Overall concurrency
■Cost-based scheduling
●Uses CoDel to manage latency
●Queries subject to timeout
Partitioning
By partitioning on a hash, queries are executed by all workers
■Parallelism is great?
●Amdahl's law
●Head of line blocking
Rust
We’ve had lots of success building on Rust and Tokio
■Intermediate schedulers woes (e.g., FuturesUnordered)
●Buffering
●Yielding
■Future cancellation is hard to reason about
Looking ahead
Smarter Routing
Move to a more dynamic load-balancing system
■Current routing is inspired by Google’s Slicer
●Balances PPS
●Slow to adjust
■Data movement to adapt to query pattern changes and bursts
Collocating Points and Tags
Single realtime database, stop indexing everything
■Columnar data
●Read only what’s being queried
●Series-per-row
●Great for SIMD
●Compression techniques that allow random access to the data (e.g.
FSST)
■No more thread-per-core?