The Latency Stack: Discovering Surprising Sources of Latency

ScyllaDB 131 views 33 slides Jun 27, 2024
Slide 1
Slide 1 of 33
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33

About This Presentation

Usually, when an API call is slow, developers blame ourselves and our code. We held a lock too long, or used a blocking operation, or built an inefficient query. But often, the simple picture of latency as “the time a server takes to process a message” hides a great deal of end-to-end complexity...


Slide Content

The Latency Stack: Discovering Surprising Sources of Latency Mark Gritter Principal Engineer at Postman

Mark Gritter ( he/him) Principal Engineer at Postman Working on monitoring and observability solutions for APIs at Akita Software, now Postman Previously built VM-aware flash storage arrays at Tintri Hobbies: gardening, weaving, math

Where does latency come from?

A simple picture of the world func(w http.ResponseWriter, r *http.Request) { } start end latency! did I wait for a lock? are my queries slow? did I miss in cache?

A more realistic picture of the world start end latency! myService.call() ret, err := ... did I make unnecessary round trips? did I send too much data?

But the more we look the worse it gets… client application server application client OS client hypervisor server OS server hypervisor DNS server TCP/IP TLS Load balancers Firewalls Service meshes

Understanding latency requires opening up our abstractions!

Example: Video Preprocessing @ Kealia One of my first jobs was at a streaming video startup. Custom hardware shipped video out; needed 1MB blocks. My software had to preprocess MPEG; we wanted to handle live TV as well as ingested movies. 3Mbit/s standard definition = one block every 2.7 seconds, delivered over TCP to a storage device. Not a highly demanding schedule… but once in a while we were several seconds late.

It must be my software’s fault Latency = from video arriving at preprocessor to block landing in disk storage. We looked at all the usual things, and some unusual ones. Inefficient algorithm? Lock contention? Bad scheduling? Memory allocation? We even built a custom tracing tool so we could see what each thread was doing!

Is it the network? We were all networking nerds… so we of course looked at a tcpdump trace on the receiver saw that the late-arriving blocks were transmitted with acceptably low latency they just started late…? syn

Is it the network? (2) But finally we look at the sending side, and caught this! syn syn The block was ready on time. We opened up a new TCP connection to send it, and the initial SYN got lost. And TCP’s retry for an unacknowledged SYN is: 5 seconds ?!?

Confession Time We could have improved our TCP connection pooling, but because we were compiling our own kernel anyway, I just reduced the retry to a few hundred milliseconds. Lesson learned: default protocol parameters cause latency.

It’s Always Storage’s Fault

Example: Flash storage @ Tintri One of my more recent jobs was at a flash storage startup. A potential customer using VMWare was seeing high I/O latency. Obviously they needed to upgrade their storage! So they rolled in a fancy all-flash storage array and… nothing improved. That gave us a shot at the problem.

The Wrong Picture VM sends request Storage accesses SSD VM receives response Latency = network delay + controller software + device read

The Closer-to-Correct Picture VM sends request Storage accesses SSD VM is scheduled to run and receives request Hypervisor receives response Latency = network delay + controller software + device read + time VM isn’t running

Hypervisor delays Why might the hypervisor be slow? Swapping (sometimes) All Virtual Machine CPUs have to be scheduled together More CPUs = harder to schedule! I have 8 physical CPUs, one of them busy. I need 8 virtual CPUs lol, stby

Measuring what matters Our strategy: Measure latency locally (in our case, for each VM) Estimate network latency using inter-arrival times Ask the hypervisor for its measurement “Host” latency = hypervisor - storage - network Grab the appropriate hypervisor stats to shift blame away from storage. Weeks of finger-pointing solved! Lesson learned: latency is the result of having too many CPUs.

When the Network is not a Network

Example: Autosupports @ Tintri Our flash storage arrays uploaded “autosupport reports” periodically. The back-end system thought the upload succeeded. The storage array thought it did not. And uploaded it it again. And again. And again.

It’s just HTTP POST! How can the client and the server disagree so badly – and so consistently – about whether the upload succeeded or not? Upload had a checksum (sent in the header) so we knew the data was not corrupt or truncated. Logging and debugging in our software showed nothing particularly unusual – the client was “just” getting some weird TLS error.

Is it the network? I was still a networking nerd… so we of course looked at a tcpdump trace on the sender and the receiver (because I had learned.) fin server rst storage array

Is it the network? (2) But that picture was a lie. Once I matched up the timestamps… fin server rst storage array

The Closer-to-Correct Picture ELB sends data slowly to server Array sends data quickly to ELB Network latency is negative: the client sees an (error) response before the server finishes! ELB closes the “idle” connection. … but keeps sending data anyway! HTTP 200 OK

Mitigation N=1 so we removed the ELB, but also fixed the underlying problem: Don’t read from a network stream 1 byte at a time. Why do ELB load balancers have 10’s of megabytes of buffering? (Do they still? I don’t know.) Lesson learned: AWS services cause latency

It’s always DNS

We’ve known it’s DNS since the early 2000’s “The contribution of DNS Lookup Costs to Web Object Retrieval" by Craig E Willis and Hao Shang, 2004, WPI-CS-TR-00-12: Finally we found that the DNS lookup time contributed more than one second to approximately 20% of retrievals for the Web objects on the home page of a larger list of popular servers. DNS is not normally slow, but when your HTTP request is slow… it’s probably DNS, particularly if you are still using a 1-second retry time.

Techniques

The Latency Stack Routing in the network Your server’s physical network hardware Your server’s hypervisor Your server’s network protocols Your server’s operating system Your server’s CPU and memory subsystems Your application’s queues Your code Your application’s dependencies The client software The client’s DNS The client’s operating system The client’s network protocols The client’s hypervisor The client’s physical network hardware Queuing in the network Middleboxes in the network

Start by looking at both sides Compare client and server measurements, highlight discrepancies: Failure and error rates Duration and latency Extend the scope of spans to cover more of the network and kernel: Server latency not just application latency eBPF probes to examine kernel queuing Network-based measurements to get the “ground truth” Draw the whole picture: Start including pieces you otherwise abstract away

Measuring Latency from a Network Trace C - B = processing latency D - A = lower bound on end-to-end latency If you’re really ambitious, RTT estimates can be derived too! A = first packet of request B = last packet of request C = first packet of response D = last packet of response

When we’re examining tail latencies, by definition we’re looking at something unusual!

Mark Gritter [email protected] @[email protected] Thank you! Let’s connect.
Tags