Distributed System Performance Troubleshooting Like You’ve Been Doing it for Twenty Years
ScyllaDB
209 views
34 slides
Jun 27, 2024
Slide 1 of 34
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
About This Presentation
Troubleshooting performance issues across distributed systems can be intimidating if you don’t know where to start, and it’s even harder when the system is running on hundreds or thousands of nodes. We’re well past the point of logging into random nodes and poking around hoping we spot the pro...
Troubleshooting performance issues across distributed systems can be intimidating if you don’t know where to start, and it’s even harder when the system is running on hundreds or thousands of nodes. We’re well past the point of logging into random nodes and poking around hoping we spot the problem. It’s critical to have a methodology to follow as well as a deep understanding of the tools that are available to help you prove (or disprove) your mental model.
In this session, we’ll explore how to go about diagnosing performance problems you might run into, and teach you the tools and process for getting to the bottom of any issue, quickly -- even when it’s one of the biggest distributed database deployments on the planet.
Size: 3.65 MB
Language: en
Added: Jun 27, 2024
Slides: 34 pages
Slide Content
Distributed System Performance Troubleshooting Like You’ve Been Doing it for Twenty Years Jon Haddad Consultant @ Rustyrazorblade Consulting
Jon Haddad ( he/him ) Consultant @ Rustyrazorblade Consulting Apache Cassandra Committer / PMC Formerly Apple & Netflix Cassandra Teams Tuned Hundreds of Cassandra Clusters I ❤️ Solving Performance Problems!
So We’ve Got a Performance Problem…
Don’t Just Blame The Database!
We Need A Methodology
Rethink Assumptions
Ask The Right Questions How slow? What’s it normally? Can I see a latency histogram? Did throughput change? Every machine or just a subset? It’s slow!
Where Is The Source? Narrow It Down!
Think About The Bigger Picture
Observability
Distributed Tracing
All Machines Or One Machine? One query or all queries?
Gather Information
Understand Your Tools info ------------------------------------------------------------------- distribution: full vectorized: true • hash join │ estimated row count: 124,482 │ equality: (rider_id) = (id) │ ├── • scan │ estimated row count: 125,000 (100% of the table; stats collected 13 minutes ago) │ table: rides@rides_pkey │ spans: FULL SCAN │ └── • scan estimated row count: 12,500 (100% of the table; stats collected 14 minutes ago) table: users@users_pkey spans: FULL SCAN index recommendations: 2 1. type: index creation SQL command: CREATE INDEX ON rides (rider_id) STORING (vehicle_city, vehicle_id, start_address, end_address, start_time, end_time, revenue); 1. type: index creation SQL command: CREATE INDEX ON users (id) STORING (name, address, credit_card); (22 rows) Time: 2ms total (execution 2ms / network 0ms)
Profiling and Flame Graphs Linux: perf, Java: async-profiler
Today’s Summary Observability is Critical Narrow the problem down Distributed Tracing Latency, Throughput Utilization, Saturation, Errors (USE) Profiling is easy and effective!