Get Low (Latency) by Benjamin Cane and Tyler Wedin

ScyllaDB 702 views 17 slides Oct 11, 2024
Slide 1
Slide 1 of 17
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17

About This Presentation

Building a real-time, low-latency card payments system is a challenge. Join the Amex Payments Network team to learn about their 100% containerized, globally distributed platform powered by Kubernetes. Discover how they tackled latency with HTTP/2, local affinity, and more. #DevOps #Kubernetes


Slide Content

A ScyllaDB Community
Get Low (Latency)
Benjamin Cane
Distinguished Engineer
Tyler Wedin
VP Global Payments
Network SRE

Introductions: Tyler Wedin
▪Vice President of Core Platforms SRE at American Express
▪My primary focus is engineering and instrumenting high
availability and resiliency into our most critical
customer journeys
▪I spent a considerable amount of my career building high-
speed and fault-tolerant infrastructure
▪My favorite greeting is a three-way handshake

▪Distinguished Engineer at American Express
▪I work on our core payments platforms
▪Throughout my career I’ve held roles in both infrastructure and
software engineering
▪Building fast, scalable, and reliable distributed systems is my
passion
Introductions: Benjamin Cane

In 2018, American Express started an initiative to rebuild its
payment network from the ground up.
We wanted to build a platform that could adapt with the future of
payments. We designed the system to be flexible as we continue to
enable new products and capabilities.


So, we chose:
▪Microservices-based architecture
▪Modern API-based interactions for internal communications and
integrations
▪Containers and Kubernetes
Payment Network Modernization

Payment Network Characteristics
Scalable Resilient Low-Latency

Understanding the Problem with
Microservices

▪Each service-to-service call increases
–Network overhead
–Latency
–Chances of request failures

▪Cross-region calls make these problems
exponentially worse
–~60 milliseconds of latency between New York
and Los Angeles*
–~260 milliseconds of latency between Singapore
and Los Angeles*
* As per https://wondernetwork.com/pings
Death by a Thousand Paper Cuts

How American Express Optimized It’s Payment
Network Architecture
ACHIEVING SCALE, LOW LATENCY, AND RESILIENCY

▪Local affinity and cross-region routing
▪HTTP/2-based protocols for service-to-service requests
▪Caching ensuring data is locally available before transactions arrive

▪Asynchronous logging and emphasis on metrics over logging

▪Local disks for databases instead of software-defined storage
▪Go as the language of choice for critical routing services
Optimizations
Today’s Focus

Design Principles:
▪Each cell is independent
▪Cells must leverage local data
▪Transactions fully processed within the nearest available cell
▪Communication across cells, availability zones, or regions
are limited to a custom router
– Microservices are unable to communicate across
availability zones via network controls
Keeping Transactions Localized by Design

Pod-to-Pod Communications

Caching and Data Locality
Preloaded,
Read-through Caching
Message-based
Replication
17
Transaction
Affinity
TO LOCALIZE DATA, WE FOLLOW THREE PATTERNS

A significant optimization for our platform was the selection of HTTP/2 and
gRPC (leverages HTTP/2).
▪HTTP/1.1: Synchronous Requests
HTTP/1.1 by design is a synchronous protocol. This means with each request;
the server must respond before the next request is sent.
▪HTTP/2: Asynchronous Requests with Connection Reuse
HTTP/2 is an asynchronous protocol. Multiple requests are sent via the same
connection.
Optimize Service Request Performance

While the use of HTTP/2 for service-to-service calls reduced our
latency, it also added a level of complexity around load balancing.
Kube-proxy is a layer 4 connection-based load balancer. With
HTTP/2, multiple transactions can be sent down a single
connection, which will result in overloading single pods.
To properly distribute load across pods we introduced a service
mesh, deploying Envoy sidecars.
Service Mesh

▪Get Low
–By focusing on locality, taking the most direct path
▪Get Low
–By limiting dependencies and pushing data ahead of time
▪Get Low
–By using asynchronous communications
▪Get Low
–By making latency and resiliency first-class features of your platform
Summary

Outro
Does any of this sound interesting? American Express is hiring! Check out our open positions at americanexpress.com/techcareers
Benjamin Cane – Distinguished Engineer
bencane.com
linkedin.com/in/bencane


Tyler Wedin – Vice President, Core Platforms Site Reliability
Engineering
linkedin.com/in/tyler-wedin-47304ba/

Thank you
Tags