Building Scalable End-to-End Latency Metrics from Distributed Trace by Kusha Maharshi

ScyllaDB 0 views 29 slides Oct 13, 2025

Slide 1 of 29

About This Presentation

In our sprawling microservices architecture at Bloomberg, timing requests from point A to point Z means navigating an alphabet's worth of services in between. With more than 50 billion spans a day, we process distributed trace into directed acyclic graphs (DAGs) with deep fan-outs and fan-ins th...

Size: 5.47 MB

Language: en

Added: Oct 13, 2025

Slides: 29 pages

Slide Content

A ScyllaDB Community
Building Scalable End-to-End
Latency Metrics From
Distributed Trace
Kusha Maharshi
Senior Software Engineer

Kusha Maharshi (she/her)

Senior Software Engineer at Bloomberg
■Telemetry infrastructure, distributed tracing
■My thought on p99s:
●Fast lane for the p99s, scenic route for the last 1%
■Distributed, streaming graph processing at scale
■Outside of work - dance, WNBA, reading

Storytime

■Recurring request: centralized end-to-end latency metrics
●In ﬁnance: milliseconds of delay = millions of dollars lost!
●State of things before: custom solutions by different engineering teams
■Distributed trace saves the day!
●Complex data:
■Long, fan-outs/fan-ins, >50B (& counting) daily spans
●So, we built a solution to handle that!

Distributed Trace
In The Wild

From equity order execution to viewing status of orders

Ensuring User Satisfaction

Why End-to-End Latency Monitoring Matters

■Monitor end-to-end client-facing workﬂows
●Client = external user, internal engineer team
■Latency spikes
●Something may be wonky!
●Alert engineers to debug
■Design & development
●End-to-end, A/B testing
■SLOs
●Also for managers, executives

Zooming In

Example Trace For Workﬂow

0s 2.3μs 5.7μs
consume
process
publish
publish
traceId = 0123456789abcdef

0s 2.4μs 5.8μs
consume
traceId = 0123456789abcde1

0s 2.1μs 5.9μs
consume
traceId = 0123456789abcde2
traceId = 0123456789abcde1
traceId = 0123456789abcde2
1
Consume queued messages
2
Batch and process
3
Publish to clients

Distributed Trace as a Directed Acyclic Graph (DAG)

t1: traceId
s11: spanId
consume: operation

Fan-in

End-to-End "Rule"

Rule: start = consume, end = publish
Get me the latency of
requests going from
consume to publish
Rule = consumeToPublish

Crux: Depth-First Search (DFS) through DAG
Difference between
start of start event
&
end of end event

Path 1: 3μs
Path 2: 3.5μs

Distributed Trace At Scale

Simple Traces At Scale

Enter "Fan-Ins"

Traces With Fan-ins (At Scale?)
Problem
Scaling with fan-ins

Solution #1
Put all spans from a
group of fanned-in traces
(aka trace bundle)
in the same partition

Solution #1

Aggregate Trace Bundles By Rule

Bundler: Merge Trace Bundles, Partition By Rule

Bundler: In-memory maps of potential trace-rule matches for trace bundles

Result: Aggregate Trace Bundles By Rule

Solution #1
■Ensures all spans from a trace bundle are in the same partition
■Works well when rules match a small set of spans & traces
■Scalability constraints
●Load imbalance
■Rule with high volume trace matches bloat dagger memory
■e.g., rules on middleware spans
●Fan-in memory bloat
■Fan-in traces that don't match any rules can exacerbate load imbalance

Solution #2

Lazy Approach: Send partial DFS info to instance with fan-in trace

Partial DFS Gossip

Solution #2
Instead of building trace bundles beforehand (greedy), this solution sends
out info about partial DFS paths when fan-in parents are encountered (lazy)
■Addresses load imbalance of solution #1
■Constraints
●Deep, chained fan-ins lead to long communication delays

Conclusion
End-to-end latency metrics from complex traces can be generated at scale
■To pick the best approach for you
●Employ operational knowledge about your trace data
●Mix and match
■End-to-end latency monitoring is crucial
●Trace provides a uniﬁed solution
●Client impact can be directly monitored
■SLOs
●Design and development of complex pipelines
■Distributed, streaming DAGs are hard and fun!

Building Scalable End-to-End Latency Metrics from Distributed Trace by Kusha Maharshi

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Building Scalable End-to-End Latency Metrics from Distributed Trace by Kusha Maharshi

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx