Measuring Query Latency the Hard Way: An Adventure in Impractical Postgres Monitoring by Simon Notley

ScyllaDB 0 views 20 slides Oct 15, 2025
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

Sampling the session state (as exposed by pg_stat_activity) is a surprisingly powerful way to understand how your Postgres instance spends its time. It is something I can wholeheartedly recommend to any Postgres DBA that needs a lightweight way to monitor query performance in production. However, it...


Slide Content

A ScyllaDB Community
Measuring Query Latency the
Hard Way: An Adventure in
Impractical Postgres Monitoring
Simon Notley
Observability and Optimization

Simon Notley (he/him)

Observability and Optimization PM at EDB
■Something cool: Gained the freedom of Tryfan
■Perspective on P99s: The time 1 in every 100 of
your users is waiting and wondering if it’s broken
■Another thing: I used to race bicycles
■Away from work: dad stuff

Pursuing terrible ideas can be fun

Take a good idea
Understand it’s strengths and
weaknesses
Undeterred, try to apply it to something
you know it’s no good at
Have fun
Session sampling
Good at proportions, bad at details
Query latency
“fun”

Time-domain sampling, huh! What is it good for?


23%
20%

Time-domain sampling, huh! What is it good for?


Details
Proportions
…and I will now use
it for details…

Everything’s not lost



datid | 16385
datname | simon
pid | 45479
leader_pid |
usesysid | 16384
usename | simon
application_name | psql
client_addr |
client_hostname |
client_port | -1
backend_start | 2025-07-04 19:54:58.321751+01
xact_start | 2025-07-04 19:55:22.23828+01
query_start | 2025-07-04 19:55:22.23828+01
state_change | 2025-07-04 19:55:22.23828+01
wait_event_type |
wait_event |
state | active
backend_xid |
backend_xmin | 778
query_id |
query | select * from pg_stat_activity;
backend_type | client backend

Even an end has a start

time
We know the query started here
It was still running here
It wasn’t running here
a b

Even an end has a start
time
a bc
estimated_duration = 2a + c

Unbiased!

For a 1000 ms query, for a range of sample periods, calculate our estimated duration for all possible
relative positions of the query and the samples
Box: 25th - 75th percentile

Whiskers: 5th - 95th percentile

Central mark: median (and mean)

Biased?

Query IDTrue Mean Latency / ms
1 56 ± 3
2 300 ± 10
3 550 ± 20
4 800 ± 20
5 1050 ± 20

Biased?

Query IDTrue Mean Latency / msEstimated Mean Latency / ms
1 56 ± 3 520 ± 70
2 300 ± 10 520 ± 40
3 550 ± 20 680 ± 40
4 800 ± 20 870 ± 40
5 1050 ± 20 1080 ± 40

Biased?

Query IDTrue Mean Latency / msEstimated Mean Latency / msTrue Mean Latency (sampled
queries only) / ms
1 56 ± 3 520 ± 70 520 ± 60
2 300 ± 10 520 ± 40 520 ± 30
3 550 ± 20 680 ± 40 680 ± 30
4 800 ± 20 870 ± 40 870 ± 20
5 1050 ± 20 1080 ± 40 1080 ± 20

Stop! Weight a minute!

Query IDTrue Mean Latency / msEstimated Mean Latency / ms
1 56 ± 3 19 ± 4
2 300 ± 10 80 ± 20
3 550 ± 20 150 ± 50
4 800 ± 20 270 ± 100
5 1050 ± 20 500 ± 200

Funky charts!

true
true_obs
elapsed
estimate
1 2 3 4 5

Funky charts!

true
true_obs
elapsed
estimate
1 2 3 4 5

Not so funky chart

Funkier…

Now we’re talking!

Nobody expects vector search!


Query IDTrue Mean Latency /
ms
Vector Search Mean
Latency / ms
Percentage error at
run-level
1 56 ± 3 60 ± 20 10 ± 30
2 300 ± 10 300 ± 40 0 ± 10
3 550 ± 20 550 ± 50 0 ± 9
4 800 ± 20 800 ± 60 0 ± 7
5 1050 ± 20 1060 ± 60 1 ± 5

Thank you! Let’s connect.
Simon Notley
[email protected]
sonotley.uk
Tags