Enhancing P99 Latency: Strategies for Doubling/Tripling Performance in Third-Party APIs

ScyllaDB 564 views 20 slides Oct 14, 2024

Slide 1 of 20

About This Presentation

Sharing our journey to improve P99 latency in third-party APIs. From optimizing network configs to fine-tuning connection management, we aimed to cut down latency and enhance user experience. Dive into our strategies and see how we achieved a smoother, more responsive service. #DevOps #API

Size: 1.48 MB

Language: en

Added: Oct 14, 2024

Slides: 20 pages

Slide Content

A ScyllaDB Community
Enhancing P99 Latency:
Strategies for Doubling/Tripling
Performance in Third-Party APIs
Cristian Velazquez
Staff Software Engineer at Uber

Cristian Velazquez

Staff Software Engineer at Uber
■Eﬃciency:
●Garbage collection tuning for Java and Go services
which have saved the company >$10M dollars
●Distributed load testing
●Metrics emission, disk based cache solutions,
latency tuning
■When I am not at work I enjoy spending time with my
family and playing video games

Some caveats about the presentation
■In order to protect Uber and our third parties’ privacy, most of the text and
images use anonymized data

What are we going to discuss today?

■Strategies you can use to reduce latency with third party providers
●Are they useful for local networks? Sure, although the improvements are going to be smaller
since latency is smaller than going through "internet"
■My learnings during this process of tuning latency

Original setup

■3 Uber core services
■Each communicates with 2-3 different 3rd party providers
●Let's calls this provider A, B and C
■Most calls are served by other Uber services, but a small percentage of the
calls are served by the 3rd party providers
■Uber has 2 major regions where all traﬃc is served
●Let's call this region "X" and region "Y"

Provider A

■Most of the 3rd party traﬃc
■Communicating to it using HTTP1
■What was the original issue?

Region X
Region Y

Why is there such difference between regions?

■Most software engineers tend to be surprised about this:
●Geographical location matters:
■Region Y is in the same state as our third party provider
■Region X is hundreds of miles away from our third party provider
Ping region Y Ping region X

Why is the ﬁrst request latency 3x of a normal
request?

■TLSv1.2.
●Migrate to TLS1.3

How can we reduce how many connections we
create?

■Migrate from HTTP1 to HTTP2

HTTP2HTTP1

Results provider A

■90% reduction in new connections created
■50% reduction in connection latency
■10% latency improvement in p95 latency (including reading the response)

Provider B

■Very low traﬃc per instance (globally is signiﬁcant)
■Communicating to it using HTTP1
■What was the original issue?
●Similar to the previous one geographical location matters so one zone was experiencing very
bad latency but the delta here was higher for p99 (250-350ms)

Issues experienced with this provider

■No HTTP2 support
■No TLSv1.3 support
■What can we do?

Let's understand the latency
curl -w "@/tmp/format.txt" -o /dev/null -s --tlsv1.2 --tls-max 1.2 'https://providerB'
■A good amount of latency is coming from dns lookup, but it only happens the ﬁrst time

Phase Time from start
time_namelookup 0.256093s
time_connect 0.296952s
time_appconnect 0.391648s
time_pretransfer 0.391719s
time_starttransfer 0.713931s
time_total 0.714028s

DNS cache?

■We were already using it
●Problem is that traﬃc per instance is so low that sometimes the cache expires
■Should we increase the ttl?

Solution?
■Async DNS. Previous client was doing the dns resolution synchronously, now
there is a background thread doing that

How to improve impact of connection latency?

■Remember no TLSv1.3 nor HTTP2 support? But why are we creating so many
connections?
●Answer: server keepalive is 4 seconds

Solution?
■Empty request to keep connection alive (every 1.5s)
● This request goes to a very fast endpoint which reduces the chance of blocking a real
production request

Results provider B
■P99 from >600ms to
250-300ms
■P50 from 400ms to
150ms

Challenges

■Java HTTP2, TLSv1.3 and async DNS support was not great
●Use a sidecar proxy server for outbounds
■Good:
●More metrics to debug connections issues
■"Bad":
●Need for fallback in case sidecar crashes/fails

Bonus: provider A

■While onboarding a new zone, we noticed 8x latency compared to other zones
●3rd party was routing us to Europe
■Tooling that is all the time monitoring route paths to providers

Thank you! Let’s connect.
Cristian Velazquez
linkedin.com/in/cdvr1993
linkedin.com/in/cdvr1993
GOGCTuner Blog
Ballast blog
Shadower blog
Presto GC tuning

Enhancing P99 Latency: Strategies for Doubling/Tripling Performance in Third-Party APIs

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Enhancing P99 Latency: Strategies for Doubling/Tripling Performance in Third-Party APIs

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......