Building Scalable End-to-End Latency Metrics from Distributed Trace by Kusha Maharshi

ScyllaDB 0 views 29 slides Oct 13, 2025
Slide 1
Slide 1 of 29
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29

About This Presentation

In our sprawling microservices architecture at Bloomberg, timing requests from point A to point Z means navigating an alphabet's worth of services in between. With more than 50 billion spans a day, we process distributed trace into directed acyclic graphs (DAGs) with deep fan-outs and fan-ins th...


Slide Content

A ScyllaDB Community
Building Scalable End-to-End
Latency Metrics From
Distributed Trace
Kusha Maharshi
Senior Software Engineer

Kusha Maharshi (she/her)

Senior Software Engineer at Bloomberg
■Telemetry infrastructure, distributed tracing
■My thought on p99s:
●Fast lane for the p99s, scenic route for the last 1%
■Distributed, streaming graph processing at scale
■Outside of work - dance, WNBA, reading

Storytime

■Recurring request: centralized end-to-end latency metrics
●In finance: milliseconds of delay = millions of dollars lost!
●State of things before: custom solutions by different engineering teams
■Distributed trace saves the day!
●Complex data:
■Long, fan-outs/fan-ins, >50B (& counting) daily spans
●So, we built a solution to handle that!

Distributed Trace
In The Wild

From equity order execution to viewing status of orders

Ensuring User Satisfaction

Why End-to-End Latency Monitoring Matters

■Monitor end-to-end client-facing workflows
●Client = external user, internal engineer team
■Latency spikes
●Something may be wonky!
●Alert engineers to debug
■Design & development
●End-to-end, A/B testing
■SLOs
●Also for managers, executives

Zooming In

Example Trace For Workflow






0s 2.3μs 5.7μs
consume
process
publish
publish
traceId = 0123456789abcdef





0s 2.4μs 5.8μs
consume
traceId = 0123456789abcde1





0s 2.1μs 5.9μs
consume
traceId = 0123456789abcde2
traceId = 0123456789abcde1
traceId = 0123456789abcde2
1
Consume queued messages
2
Batch and process
3
Publish to clients

Distributed Trace as a Directed Acyclic Graph (DAG)



t1: traceId
s11: spanId
consume: operation

Fan-in

End-to-End "Rule"

Rule: start = consume, end = publish
Get me the latency of
requests going from
consume to publish
Rule = consumeToPublish

Crux: Depth-First Search (DFS) through DAG
Difference between
start of start event
&
end of end event

Path 1: 3μs
Path 2: 3.5μs

Distributed Trace At Scale

Simple Traces At Scale

Enter "Fan-Ins"

Traces With Fan-ins (At Scale?)
Problem
Scaling with fan-ins

Solution #1
Put all spans from a
group of fanned-in traces
(aka trace bundle)
in the same partition

Solution #1

Aggregate Trace Bundles By Rule

Bundler: Merge Trace Bundles, Partition By Rule

Bundler: In-memory maps of potential trace-rule matches for trace bundles

Result: Aggregate Trace Bundles By Rule

Solution #1
■Ensures all spans from a trace bundle are in the same partition
■Works well when rules match a small set of spans & traces
■Scalability constraints
●Load imbalance
■Rule with high volume trace matches bloat dagger memory
■e.g., rules on middleware spans
●Fan-in memory bloat
■Fan-in traces that don't match any rules can exacerbate load imbalance

Solution #2

Lazy Approach: Send partial DFS info to instance with fan-in trace

Partial DFS Gossip

Solution #2
Instead of building trace bundles beforehand (greedy), this solution sends
out info about partial DFS paths when fan-in parents are encountered (lazy)
■Addresses load imbalance of solution #1
■Constraints
●Deep, chained fan-ins lead to long communication delays

Conclusion
End-to-end latency metrics from complex traces can be generated at scale
■To pick the best approach for you
●Employ operational knowledge about your trace data
●Mix and match
■End-to-end latency monitoring is crucial
●Trace provides a unified solution
●Client impact can be directly monitored
■SLOs
●Design and development of complex pipelines
■Distributed, streaming DAGs are hard and fun!

Thank you! Let’s connect.
Kusha Maharshi
[email protected]
Tags