Queues, Hockey Sticks and Performance by David Collier-Brown

ScyllaDB 309 views 30 slides Oct 15, 2024
Slide 1
Slide 1 of 30
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30

About This Presentation

Queues: both a blessing and a curse in computer science. They help predict performance but also signal overload. This talk explores their role in diagnosis, capacity planning, and development using physics concepts and the "hockey-stick" curve. Master queue intuition for better programs. #...


Slide Content

A ScyllaDB Community
Queues, Hockey Sticks and
Performance
David Collier-Brown

No, Not the Tim Horton Kind!

This Kind
Two graphs from a textbook:
■The upper one is throughput
■The lower is slowdown under load
(aka queue delay)
−You’ve probably plotted the top
one from a load test
■Both are really hockey-sticks, but the
top one is upside-down

Our Graphs
■They can be computed from one
another
■If you measure response time, you
can build a little mathematical model,
like the one I used to draw these
■And the latter is easy to draw

Part 1. Slowdowns in a Benchmark
■That’s a whole collection of hockey-sticks
●Look at the dark-red one, for example
■This is not a nice result

What Did We Expect?

■A flat line around 1
■Rising to 2 at quite a high load

■This is an increase in response
time
■The increases tells us work is
stuck in a queue

What Did We Just See?

The DBA Asked for a CPU Chart

■He’d noticed “DB Writer”
slowing down
●That should never happen
■DB Writer (black) is a critical part of
the database: it updates the disk
■Middleware (yellow), on the other
hand, grows without bound

Middleware vs DB Writer
■Middleware just keeps going
up
■DB writer heads down
15002250 30003750
Middleware7.42%11.91%26.38%31.73%
DB Writer 0.43%0.67%0.67%0.42%

Fixed It!

■We gave DB Writer
guaranteed CPU
■We also doubled the
number of CPU
cores (We had run
out of CPU, too (:-))

Part 2. Why Does it Happen?

■Because I have more work
than CPUs
■This is what that causes
●the Y axis is queue delay
●the X axis is the start times of
the transaction
●Units are tenths of a second

Why Does it Happen II

■Transaction 2 isn’t done when
3 arrives
■Three has to wait
■Ditto 4 and 5

The Line of Green Boxes Created the Handle

■The horizontal line is the initial
service time
■The diagonal one is the delay we get
from not enough resources
■And the curve between them is from
probability
●The busier we are, the higher probability a
transaction will have to wait

Why and What For

■We use the slowdown curve in
at least four areas
1)capacity planning
2)diagnosis
3)development and
4)repair

Part 3. Capacity Planning
■The risk is of over-
or under-buying
●Over-buying
wastes money
●Under-buying
causes a
business failure

Capacity Planning, Ctd
■We need to “just
stay ahead of
demand”
●Marketing does
the estimate
●Ops buys
enough
machines

Part 4. Diagnosis
3.1 Another slowness
graph
■The red line is the
new machine
■The blue one is the
old
●Something’s
wrong, the new
one is slower

Old Versus New

■Old was reusing
established
connections,
saving lots of time
■New was not
●It was mis-set to
use HTTP 1.1

Part 5.Bottleneck-Hunting

■Process 1 is the bottleneck
■What happens if we fix it?
●The performance almost
doubles

Where Bottleneck Removal Doesn’t Work

■If we fix process 1,
■We just bottleneck
on process 2

Part 6. Development
Too many cats, all wanting to fed riight now

If you’re writing the program and it uses https,
then return status 429:
■Tells the client to slow down.
● Browsers report it to the user to retry
● Various packages will resubmit. eg,
golang’s retryafter
■Is part of http, and 4XX codes are
retryable
■It forces the client to take the time to
re-send, even if the client would like to
ignore it and proceed immediately

What this sends

■429 means “wait”
■Retry-After: 3600 means
“after an hour”

Part 7. Controlling Demand
■Control the sender
●TCP/IP does exactly
that
■When someone sends too
rapidly, they don’t get a
go-ahead from the recipient
●They are delayed,
causing them to slow
down. They do so,
then gradually speed
up until they are
slowed once again

When They Cheat
■“Bufferbloat” is excessive
buffering, trying to get more of
the channel than is fair
●The shape of those curves should
be familiar (:-))

How? Controlling Demand
Is the same as managing your
unread books
■First, capture their credit card
■If that doesn’t work, smash
their internet connection
■That’s what CAKE and a
program called LibreQoS do,
less violently

For One Thing, Signal Sooner
■Send “slow down”
headers before
stopping the
acknowledgments
■Stay safely below the
maximum
throughput (around
80% utilization)

CAKE does this for home routers
LibreQoS does it for entire ISPs

And That’s It
■You now know everything I’ve learned about queues in the last ten
years (:-))
■Go fix something!

References
■LibreQoS – libreqos.io
■“You Don’t know Jack” articles (part of a series)
−Application Performance, about queues, https://dl.acm.org/doi/10.1145/3595862
−Bandwidth, about LibreQoS and TCP/IP, https://dl.acm.org/doi/10.1145/3674953
■ • Two books from my favourite mathie, Neil J. Gunther, at
http://www.perfdynamics.com/
−Analyzing Computer System Performance with Perl::PDQ
− Guerrilla Capacity Planning
■Bufferbloat article, https://dl.acm.org/doi/pdf/10.1145/2063166.2071893
■TeamQuest Predictor,
https://www.fortra.com/resources/datasheets/vcm-enterprise

Thank you! Let’s connect.
David Collier-Brown
[email protected]
https://hachyderm.io/@davecb (Mastodon)
https://leafless.ca
Tags