The Anatomy of Failure - Lessons from running systems to serve millions of people
JohnPaulAlcala
4 views
33 slides
Sep 21, 2024
Slide 1 of 33
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
About This Presentation
Presented at AWS Community Day Philippines 2024. This 30 minute talk discusses why systems fail at scale, and what mitigation strategies can be implemented to improve performance and reliability.
Topics discussed include how to get the resource requirements of your system, preventing resource over...
Presented at AWS Community Day Philippines 2024. This 30 minute talk discusses why systems fail at scale, and what mitigation strategies can be implemented to improve performance and reliability.
Topics discussed include how to get the resource requirements of your system, preventing resource overcommit through rate limiting and circuit breakers, and asynchronous processing for a better user experience.
Size: 642.87 KB
Language: en
Added: Sep 21, 2024
Slides: 33 pages
Slide Content
John Paul Alcala
The Anatomy of Failure
Lessons from running systems to serve millions of people
July 7, 2021 - family.ikea.com.ph went down due to
100,000 concurrent signups during the first minute
(1667-100k TPS)
You just deployed your Laravel App
to production. Then this happens.
The fun part: it's just
you and the QA using
the app
It's lunchtime. You're
about to pay, then this
happens.
JP ALCALA
I am an independent consultant.
I have a lot of exes.
Here are some of them.
Why do systems fail at scale?
Consider this Wordpress Installation
Average Traffic: 20 TPS
Average Latency: 50ms
P99 Latency: 200ms
Stack: LAMP
Instance Size: c7i.large
(2 CPU, 4GB RAM)
Then, Marketing has a Limited Time Promo
400 TPS
⬆︎CPU
⬆︎Memory
App Crashes
Still responsive
CPU at 30%
Users Experience:
503, 502, Request Timeouts
To Deal With The Traffic, You Scale Up
•App server CPU utilization
goes down
•App server Load Average
goes up
•DB is unresponsive due to
large number of queries
•Users still experience 503,
502, Request Timeouts
Scaling Systems - A Game of Whack-A-Mole
•Upgrade the DB Server To
Accommodate the Demand
•Requires downtime to
implement ASAP
•Promo ends, and you need
to scale down again to save
on cost
•Users don't want to come
back due to bad experience
We DDoS Ourselves
And don't get me started with
Microservices. Scaling problem becomes
10x
Lessons from dealing with millions of customers
In no particular order
•Resources are finite. Scaling up requires time.
•Hardware failure is real, even in the cloud. Spend for redundancy.
•External systems will screw you. Have a circuit breaker in place.
•Understand the performance characteristics and resource requirements of your
application.
•Implement rate limiting on applications to prevent them from overcommitting to a
high volume of incoming requests.
•Utilize asynchronous processing to improve user experience. Nobody likes to wait.
Mitigation Strategies
General guidelines for your sanity
The obvious ones
•Static assets should be served by a CDN
•Optimize API's/server-side code for high throughput
•Code changes of course!
•Reduce the amount of logs for max performance
•IO kills performance
•Unnecessary cost
How much does logging affect performance?
Here is a sample Flask application. Incoming traffic is 50 TPS
# P99 latency: 2.88ms
@app.route("/")
def hello_world():
return "<p>Hello, World!</p>"
# P99 latency: 3.088ms
@app.route("/")
def hello_world():
app.logger.info('hello')
return "<p>Hello, World!</p>"
# P99 latency: 4.088ms
@app.route("/")
def hello_world():
app.logger.info('hello')
app.logger.info('world')
return "<p>Hello, World!</p>"
7% impact
with one line
35% impact
with two lines
The obvious ones
•Make efficient use of the DB
•Usually more costly to scale up
•Usually requires downtime to scale up/down
•Do reports on a separate DB instance
Your infrastructure is only as good
as the applications running inside it
Rightsize the number of worker threads
Don't overcommit.
•Goals:
•Establish application's performance and limits
•Reduce CPU contention under load by rightsizing the number of workers
•Eliminate capacity guesswork by turning application scaling into a
mathematical formula:
•Target TPS = Base TPS @2CPU x Number of Instances
•1000 TPS = 20 TPS @2CPU x 50 Instances of c7i.large
Rightsize the number of worker threads
Don't overcommit.
•Start with a baseline of 2 CPU's and 2 workers on your local machine
•Easiest way: Docker Containers
•Determine the max TPS of your application based on the following parameters:
•Number of concurrent requests
•Latencies - Average, P99, Max
•CPU and Memory Utilization under load
•CPU Utilization below 50%?
•Increase number of workers to 4. Load test and compare
Rightsize the number of worker threads
Don't overcommit.
•Determine the instance type to use by determining the CPU to Memory Ratio
•c series - 1:2
•m series - 1:4
•r series - 1:8
•Instance type determines incremental throughput (don't forget to configure the
workers to match available resources)
•c7i.large - 20 TPS
•c7i.2xlarge - 40 TPS
Implement rate limiting and circuit breaker
Traffic control for a better experience
•Goals:
•Maintain established SLA's
•Prevent overcommitting to a high volume of requests
•Respond to unprocessable requests as fast as possible
•A responsive app that returns an error is much better than an app that is
slow to respond, if it responds at all
Implement rate limiting
Traffic control for a better experience
•Rate limiting implementations
•Built-in to the application
•Web server
•Proxy
•Return HTTP Status Code 429 (Too Many Requests)
Implement circuit breaker
Traffic control for a better experience
•Not enough worker threads to process the request
•Upstream dependency is not within SLA
•Response time beyond 3 seconds due to either application or network
•Upstream dependency is returning HTTP Status Code 5xx for a set number
of times
•Implemented within the application or managed via an API Gateway
•Return HTTP Status Code 503 (Service Unavailable)
Plan for High Traffic Events
Users are crazy
•Relying on autoscaling is usually not fast enough to react to sudden surge of
traffic
•Prepare excess capacity before the event
•Make sure to scale down after the event
Asynchronous processing for better scalability
I want to do other things than wait
•The user should not need to wait for certain operations, such as:
•Sending an email
•Finalizing an order
•Requesting for a 1 GB CSV dump
•Allows for higher throughput compared to synchronous processing
•Less resource requirements
•Requires workflow and code changes
Bonus: Dealing with 502's
https://en.wikipedia.org/wiki/HTTP_502
The 502 Bad Gateway error is an HTTP status code that occurs when a server
acting as a gateway or proxy receives an invalid or faulty response from another
server in the communication chain. This error indicates a problem with the
communication between the involved servers and can result in disruption of
internet services. The 502 Bad Gateway error is considered one of the most
common error codes on the internet and can occur in various scenarios.
What causes 502's?
Insert daddy joke here
•Overloaded server
•Misconfigured infrastructure - firewall restrictions, invalid DNS, or inadequate
routing rules
•Upstream not functioning properly - server failure, software issues, etc.
•Network issues - unstable connections, packet loss, etc.
Fixing the Jeans...I mean 502's
•Look at the Why's (see previous slide)
•Other considerations:
•Increasing timeouts do not solve the problem. It just masks it!
•Application prematurely terminates sending response
•Poor application performance
•An error occurred while transmitting response
•Look into reducing or eliminating buffering
This is just the beginning
Other mitigation strategies not discussed
These are worth looking into
•Idempotency
•Caching
•Database Sharding
•Dealing with hardware failure
•DDoS mitigation due to attacks
John Paul Alcala
Q&A
And feel free to add me on LinkedIn