The Anatomy of Failure - Lessons from running systems to serve millions of people

JohnPaulAlcala 4 views 33 slides Sep 21, 2024
Slide 1
Slide 1 of 33
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33

About This Presentation

Presented at AWS Community Day Philippines 2024. This 30 minute talk discusses why systems fail at scale, and what mitigation strategies can be implemented to improve performance and reliability.

Topics discussed include how to get the resource requirements of your system, preventing resource over...


Slide Content

John Paul Alcala
The Anatomy of Failure
Lessons from running systems to serve millions of people

July 7, 2021 - family.ikea.com.ph went down due to
100,000 concurrent signups during the first minute
(1667-100k TPS)

You just deployed your Laravel App
to production. Then this happens.
The fun part: it's just
you and the QA using
the app

It's lunchtime. You're
about to pay, then this
happens.

JP ALCALA
I am an independent consultant.
I have a lot of exes.
Here are some of them.

Why do systems fail at scale?

Consider this Wordpress Installation
Average Traffic: 20 TPS
Average Latency: 50ms
P99 Latency: 200ms
Stack: LAMP
Instance Size: c7i.large
(2 CPU, 4GB RAM)

Then, Marketing has a Limited Time Promo
400 TPS
⬆︎CPU
⬆︎Memory
App Crashes
Still responsive
CPU at 30%
Users Experience:
503, 502, Request Timeouts

To Deal With The Traffic, You Scale Up
•App server CPU utilization
goes down
•App server Load Average
goes up
•DB is unresponsive due to
large number of queries
•Users still experience 503,
502, Request Timeouts

Scaling Systems - A Game of Whack-A-Mole
•Upgrade the DB Server To
Accommodate the Demand
•Requires downtime to
implement ASAP
•Promo ends, and you need
to scale down again to save
on cost
•Users don't want to come
back due to bad experience

We DDoS Ourselves

And don't get me started with
Microservices. Scaling problem becomes
10x

Lessons from dealing with millions of customers
In no particular order
•Resources are finite. Scaling up requires time.
•Hardware failure is real, even in the cloud. Spend for redundancy.
•External systems will screw you. Have a circuit breaker in place.
•Understand the performance characteristics and resource requirements of your
application.
•Implement rate limiting on applications to prevent them from overcommitting to a
high volume of incoming requests.
•Utilize asynchronous processing to improve user experience. Nobody likes to wait.

Mitigation Strategies
General guidelines for your sanity

The obvious ones
•Static assets should be served by a CDN
•Optimize API's/server-side code for high throughput
•Code changes of course!
•Reduce the amount of logs for max performance
•IO kills performance
•Unnecessary cost

How much does logging affect performance?
Here is a sample Flask application. Incoming traffic is 50 TPS
# P99 latency: 2.88ms
@app.route("/")
def hello_world():
return "<p>Hello, World!</p>"
# P99 latency: 3.088ms
@app.route("/")
def hello_world():
app.logger.info('hello')
return "<p>Hello, World!</p>"
# P99 latency: 4.088ms
@app.route("/")
def hello_world():
app.logger.info('hello')
app.logger.info('world')
return "<p>Hello, World!</p>"
7% impact
with one line
35% impact
with two lines

The obvious ones
•Make efficient use of the DB
•Usually more costly to scale up
•Usually requires downtime to scale up/down
•Do reports on a separate DB instance

Your infrastructure is only as good
as the applications running inside it

Rightsize the number of worker threads
Don't overcommit.
•Goals:
•Establish application's performance and limits
•Reduce CPU contention under load by rightsizing the number of workers
•Eliminate capacity guesswork by turning application scaling into a
mathematical formula:
•Target TPS = Base TPS @2CPU x Number of Instances
•1000 TPS = 20 TPS @2CPU x 50 Instances of c7i.large

Rightsize the number of worker threads
Don't overcommit.
•Start with a baseline of 2 CPU's and 2 workers on your local machine
•Easiest way: Docker Containers
•Determine the max TPS of your application based on the following parameters:
•Number of concurrent requests
•Latencies - Average, P99, Max
•CPU and Memory Utilization under load
•CPU Utilization below 50%?
•Increase number of workers to 4. Load test and compare

Rightsize the number of worker threads
Don't overcommit.
•Determine the instance type to use by determining the CPU to Memory Ratio
•c series - 1:2
•m series - 1:4
•r series - 1:8
•Instance type determines incremental throughput (don't forget to configure the
workers to match available resources)
•c7i.large - 20 TPS
•c7i.2xlarge - 40 TPS

Implement rate limiting and circuit breaker
Traffic control for a better experience
•Goals:
•Maintain established SLA's
•Prevent overcommitting to a high volume of requests
•Respond to unprocessable requests as fast as possible
•A responsive app that returns an error is much better than an app that is
slow to respond, if it responds at all

Implement rate limiting
Traffic control for a better experience
•Rate limiting implementations
•Built-in to the application
•Web server
•Proxy
•Return HTTP Status Code 429 (Too Many Requests)

Implement circuit breaker
Traffic control for a better experience
•Not enough worker threads to process the request
•Upstream dependency is not within SLA
•Response time beyond 3 seconds due to either application or network
•Upstream dependency is returning HTTP Status Code 5xx for a set number
of times
•Implemented within the application or managed via an API Gateway
•Return HTTP Status Code 503 (Service Unavailable)

Plan for High Traffic Events
Users are crazy
•Relying on autoscaling is usually not fast enough to react to sudden surge of
traffic
•Prepare excess capacity before the event
•Make sure to scale down after the event

Asynchronous processing for better scalability
I want to do other things than wait
•The user should not need to wait for certain operations, such as:
•Sending an email
•Finalizing an order
•Requesting for a 1 GB CSV dump
•Allows for higher throughput compared to synchronous processing
•Less resource requirements
•Requires workflow and code changes

Bonus: Dealing with 502's

https://en.wikipedia.org/wiki/HTTP_502
The 502 Bad Gateway error is an HTTP status code that occurs when a server
acting as a gateway or proxy receives an invalid or faulty response from another
server in the communication chain. This error indicates a problem with the
communication between the involved servers and can result in disruption of
internet services. The 502 Bad Gateway error is considered one of the most
common error codes on the internet and can occur in various scenarios.

What causes 502's?
Insert daddy joke here
•Overloaded server
•Misconfigured infrastructure - firewall restrictions, invalid DNS, or inadequate
routing rules
•Upstream not functioning properly - server failure, software issues, etc.
•Network issues - unstable connections, packet loss, etc.

Fixing the Jeans...I mean 502's
•Look at the Why's (see previous slide)
•Other considerations:
•Increasing timeouts do not solve the problem. It just masks it!
•Application prematurely terminates sending response
•Poor application performance
•An error occurred while transmitting response
•Look into reducing or eliminating buffering

This is just the beginning

Other mitigation strategies not discussed
These are worth looking into
•Idempotency
•Caching
•Database Sharding
•Dealing with hardware failure
•DDoS mitigation due to attacks

John Paul Alcala
Q&A
And feel free to add me on LinkedIn