Enhancing P99 Latency: Strategies for Doubling/Tripling Performance in Third-Party APIs

ScyllaDB 564 views 20 slides Oct 14, 2024
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

Sharing our journey to improve P99 latency in third-party APIs. From optimizing network configs to fine-tuning connection management, we aimed to cut down latency and enhance user experience. Dive into our strategies and see how we achieved a smoother, more responsive service. #DevOps #API


Slide Content

A ScyllaDB Community
Enhancing P99 Latency:
Strategies for Doubling/Tripling
Performance in Third-Party APIs
Cristian Velazquez
Staff Software Engineer at Uber

Cristian Velazquez

Staff Software Engineer at Uber
■Efficiency:
●Garbage collection tuning for Java and Go services
which have saved the company >$10M dollars
●Distributed load testing
●Metrics emission, disk based cache solutions,
latency tuning
■When I am not at work I enjoy spending time with my
family and playing video games

Some caveats about the presentation
■In order to protect Uber and our third parties’ privacy, most of the text and
images use anonymized data

What are we going to discuss today?

■Strategies you can use to reduce latency with third party providers
●Are they useful for local networks? Sure, although the improvements are going to be smaller
since latency is smaller than going through "internet"
■My learnings during this process of tuning latency

Original setup

■3 Uber core services
■Each communicates with 2-3 different 3rd party providers
●Let's calls this provider A, B and C
■Most calls are served by other Uber services, but a small percentage of the
calls are served by the 3rd party providers
■Uber has 2 major regions where all traffic is served
●Let's call this region "X" and region "Y"

Provider A

■Most of the 3rd party traffic
■Communicating to it using HTTP1
■What was the original issue?

Region X
Region Y

Why is there such difference between regions?

■Most software engineers tend to be surprised about this:
●Geographical location matters:
■Region Y is in the same state as our third party provider
■Region X is hundreds of miles away from our third party provider
Ping region Y Ping region X

Why is the first request latency 3x of a normal
request?

■TLSv1.2.
●Migrate to TLS1.3

How can we reduce how many connections we
create?

■Migrate from HTTP1 to HTTP2

HTTP2HTTP1

Results provider A

■90% reduction in new connections created
■50% reduction in connection latency
■10% latency improvement in p95 latency (including reading the response)

Provider B

■Very low traffic per instance (globally is significant)
■Communicating to it using HTTP1
■What was the original issue?
●Similar to the previous one geographical location matters so one zone was experiencing very
bad latency but the delta here was higher for p99 (250-350ms)

Issues experienced with this provider

■No HTTP2 support
■No TLSv1.3 support
■What can we do?

Let's understand the latency
curl -w "@/tmp/format.txt" -o /dev/null -s --tlsv1.2 --tls-max 1.2 'https://providerB'
■A good amount of latency is coming from dns lookup, but it only happens the first time

Phase Time from start
time_namelookup 0.256093s
time_connect 0.296952s
time_appconnect 0.391648s
time_pretransfer 0.391719s
time_starttransfer 0.713931s
time_total 0.714028s

DNS cache?


■We were already using it
●Problem is that traffic per instance is so low that sometimes the cache expires
■Should we increase the ttl?

Solution?
■Async DNS. Previous client was doing the dns resolution synchronously, now
there is a background thread doing that

How to improve impact of connection latency?

■Remember no TLSv1.3 nor HTTP2 support? But why are we creating so many
connections?
●Answer: server keepalive is 4 seconds

Solution?
■Empty request to keep connection alive (every 1.5s)
● This request goes to a very fast endpoint which reduces the chance of blocking a real
production request

Results provider B
■P99 from >600ms to
250-300ms
■P50 from 400ms to
150ms

Challenges

■Java HTTP2, TLSv1.3 and async DNS support was not great
●Use a sidecar proxy server for outbounds
■Good:
●More metrics to debug connections issues
■"Bad":
●Need for fallback in case sidecar crashes/fails

Bonus: provider A

■While onboarding a new zone, we noticed 8x latency compared to other zones
●3rd party was routing us to Europe
■Tooling that is all the time monitoring route paths to providers

Thank you! Let’s connect.
Cristian Velazquez
linkedin.com/in/cdvr1993
linkedin.com/in/cdvr1993
GOGCTuner Blog
Ballast blog
Shadower blog
Presto GC tuning
Tags