Building Scalable End-to-End Latency Metrics from Distributed Trace by Kusha Maharshi
ScyllaDB
0 views
29 slides
Oct 13, 2025
Slide 1 of 29
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
About This Presentation
In our sprawling microservices architecture at Bloomberg, timing requests from point A to point Z means navigating an alphabet's worth of services in between. With more than 50 billion spans a day, we process distributed trace into directed acyclic graphs (DAGs) with deep fan-outs and fan-ins th...
In our sprawling microservices architecture at Bloomberg, timing requests from point A to point Z means navigating an alphabet's worth of services in between. With more than 50 billion spans a day, we process distributed trace into directed acyclic graphs (DAGs) with deep fan-outs and fan-ins that present concurrency chokepoints at scale. In this talk, we'll show how we wrestled complicated DAGs into shape, thus generating end-to-end latency metrics that drive SLOs, fire off alerts, and keep us all honest when latency starts creeping in!
Size: 5.47 MB
Language: en
Added: Oct 13, 2025
Slides: 29 pages
Slide Content
A ScyllaDB Community
Building Scalable End-to-End
Latency Metrics From
Distributed Trace
Kusha Maharshi
Senior Software Engineer
Kusha Maharshi (she/her)
Senior Software Engineer at Bloomberg
■Telemetry infrastructure, distributed tracing
■My thought on p99s:
●Fast lane for the p99s, scenic route for the last 1%
■Distributed, streaming graph processing at scale
■Outside of work - dance, WNBA, reading
Storytime
■Recurring request: centralized end-to-end latency metrics
●In finance: milliseconds of delay = millions of dollars lost!
●State of things before: custom solutions by different engineering teams
■Distributed trace saves the day!
●Complex data:
■Long, fan-outs/fan-ins, >50B (& counting) daily spans
●So, we built a solution to handle that!
Distributed Trace
In The Wild
From equity order execution to viewing status of orders
Ensuring User Satisfaction
Why End-to-End Latency Monitoring Matters
■Monitor end-to-end client-facing workflows
●Client = external user, internal engineer team
■Latency spikes
●Something may be wonky!
●Alert engineers to debug
■Design & development
●End-to-end, A/B testing
■SLOs
●Also for managers, executives
Zooming In
Example Trace For Workflow
0s 2.3μs 5.7μs
consume
process
publish
publish
traceId = 0123456789abcdef
0s 2.4μs 5.8μs
consume
traceId = 0123456789abcde1
0s 2.1μs 5.9μs
consume
traceId = 0123456789abcde2
traceId = 0123456789abcde1
traceId = 0123456789abcde2
1
Consume queued messages
2
Batch and process
3
Publish to clients
Distributed Trace as a Directed Acyclic Graph (DAG)
t1: traceId
s11: spanId
consume: operation
Fan-in
End-to-End "Rule"
Rule: start = consume, end = publish
Get me the latency of
requests going from
consume to publish
Rule = consumeToPublish
Crux: Depth-First Search (DFS) through DAG
Difference between
start of start event
&
end of end event
Path 1: 3μs
Path 2: 3.5μs
Distributed Trace At Scale
Simple Traces At Scale
Enter "Fan-Ins"
Traces With Fan-ins (At Scale?)
Problem
Scaling with fan-ins
Solution #1
Put all spans from a
group of fanned-in traces
(aka trace bundle)
in the same partition
Solution #1
Aggregate Trace Bundles By Rule
Bundler: Merge Trace Bundles, Partition By Rule
Bundler: In-memory maps of potential trace-rule matches for trace bundles
Result: Aggregate Trace Bundles By Rule
Solution #1
■Ensures all spans from a trace bundle are in the same partition
■Works well when rules match a small set of spans & traces
■Scalability constraints
●Load imbalance
■Rule with high volume trace matches bloat dagger memory
■e.g., rules on middleware spans
●Fan-in memory bloat
■Fan-in traces that don't match any rules can exacerbate load imbalance
Solution #2
Lazy Approach: Send partial DFS info to instance with fan-in trace
Partial DFS Gossip
Solution #2
Instead of building trace bundles beforehand (greedy), this solution sends
out info about partial DFS paths when fan-in parents are encountered (lazy)
■Addresses load imbalance of solution #1
■Constraints
●Deep, chained fan-ins lead to long communication delays
Conclusion
End-to-end latency metrics from complex traces can be generated at scale
■To pick the best approach for you
●Employ operational knowledge about your trace data
●Mix and match
■End-to-end latency monitoring is crucial
●Trace provides a unified solution
●Client impact can be directly monitored
■SLOs
●Design and development of complex pipelines
■Distributed, streaming DAGs are hard and fun!