The Anatomy of Failure - Lessons from running systems to serve millions of people

JohnPaulAlcala 4 views 33 slides Sep 21, 2024

Slide 1 of 33

About This Presentation

Presented at AWS Community Day Philippines 2024. This 30 minute talk discusses why systems fail at scale, and what mitigation strategies can be implemented to improve performance and reliability.

Topics discussed include how to get the resource requirements of your system, preventing resource over...

Size: 642.87 KB

Language: en

Added: Sep 21, 2024

Slides: 33 pages

Slide Content

John Paul Alcala
The Anatomy of Failure
Lessons from running systems to serve millions of people

July 7, 2021 - family.ikea.com.ph went down due to
100,000 concurrent signups during the first minute
(1667-100k TPS)

You just deployed your Laravel App
to production. Then this happens.
The fun part: it's just
you and the QA using
the app

It's lunchtime. You're
about to pay, then this
happens.

JP ALCALA
I am an independent consultant.
I have a lot of exes.
Here are some of them.

Why do systems fail at scale?

Consider this Wordpress Installation
Average Traffic: 20 TPS
Average Latency: 50ms
P99 Latency: 200ms
Stack: LAMP
Instance Size: c7i.large
(2 CPU, 4GB RAM)

Then, Marketing has a Limited Time Promo
400 TPS
⬆︎CPU
⬆︎Memory
App Crashes
Still responsive
CPU at 30%
Users Experience:
503, 502, Request Timeouts

To Deal With The Traffic, You Scale Up
•App server CPU utilization
goes down
•App server Load Average
goes up
•DB is unresponsive due to
large number of queries
•Users still experience 503,
502, Request Timeouts

Scaling Systems - A Game of Whack-A-Mole
•Upgrade the DB Server To
Accommodate the Demand
•Requires downtime to
implement ASAP
•Promo ends, and you need
to scale down again to save
on cost
•Users don't want to come
back due to bad experience

We DDoS Ourselves

And don't get me started with
Microservices. Scaling problem becomes
10x

Lessons from dealing with millions of customers
In no particular order
•Resources are finite. Scaling up requires time.
•Hardware failure is real, even in the cloud. Spend for redundancy.
•External systems will screw you. Have a circuit breaker in place.
•Understand the performance characteristics and resource requirements of your
application.
•Implement rate limiting on applications to prevent them from overcommitting to a
high volume of incoming requests.
•Utilize asynchronous processing to improve user experience. Nobody likes to wait.

Mitigation Strategies
General guidelines for your sanity

The obvious ones
•Static assets should be served by a CDN
•Optimize API's/server-side code for high throughput
•Code changes of course!
•Reduce the amount of logs for max performance
•IO kills performance
•Unnecessary cost

How much does logging affect performance?
Here is a sample Flask application. Incoming traffic is 50 TPS
# P99 latency: 2.88ms
@app.route("/")
def hello_world():
return "Hello, World!"
# P99 latency: 3.088ms
@app.route("/")
def hello_world():
app.logger.info('hello')
return "Hello, World!"
# P99 latency: 4.088ms
@app.route("/")
def hello_world():
app.logger.info('hello')
app.logger.info('world')
return "Hello, World!"
7% impact
with one line
35% impact
with two lines

The obvious ones
•Make efficient use of the DB
•Usually more costly to scale up
•Usually requires downtime to scale up/down
•Do reports on a separate DB instance

Your infrastructure is only as good
as the applications running inside it

Rightsize the number of worker threads
Don't overcommit.
•Goals:
•Establish application's performance and limits
•Reduce CPU contention under load by rightsizing the number of workers
•Eliminate capacity guesswork by turning application scaling into a
mathematical formula:
•Target TPS = Base TPS @2CPU x Number of Instances
•1000 TPS = 20 TPS @2CPU x 50 Instances of c7i.large

Rightsize the number of worker threads
Don't overcommit.
•Start with a baseline of 2 CPU's and 2 workers on your local machine
•Easiest way: Docker Containers
•Determine the max TPS of your application based on the following parameters:
•Number of concurrent requests
•Latencies - Average, P99, Max
•CPU and Memory Utilization under load
•CPU Utilization below 50%?
•Increase number of workers to 4. Load test and compare

Rightsize the number of worker threads
Don't overcommit.
•Determine the instance type to use by determining the CPU to Memory Ratio
•c series - 1:2
•m series - 1:4
•r series - 1:8
•Instance type determines incremental throughput (don't forget to configure the
workers to match available resources)
•c7i.large - 20 TPS
•c7i.2xlarge - 40 TPS

Implement rate limiting and circuit breaker
Traffic control for a better experience
•Goals:
•Maintain established SLA's
•Prevent overcommitting to a high volume of requests
•Respond to unprocessable requests as fast as possible
•A responsive app that returns an error is much better than an app that is
slow to respond, if it responds at all

Implement rate limiting
Traffic control for a better experience
•Rate limiting implementations
•Built-in to the application
•Web server
•Proxy
•Return HTTP Status Code 429 (Too Many Requests)

Implement circuit breaker
Traffic control for a better experience
•Not enough worker threads to process the request
•Upstream dependency is not within SLA
•Response time beyond 3 seconds due to either application or network
•Upstream dependency is returning HTTP Status Code 5xx for a set number
of times
•Implemented within the application or managed via an API Gateway
•Return HTTP Status Code 503 (Service Unavailable)

Plan for High Traffic Events
Users are crazy
•Relying on autoscaling is usually not fast enough to react to sudden surge of
traffic
•Prepare excess capacity before the event
•Make sure to scale down after the event

Asynchronous processing for better scalability
I want to do other things than wait
•The user should not need to wait for certain operations, such as:
•Sending an email
•Finalizing an order
•Requesting for a 1 GB CSV dump
•Allows for higher throughput compared to synchronous processing
•Less resource requirements
•Requires workflow and code changes

Bonus: Dealing with 502's

https://en.wikipedia.org/wiki/HTTP_502
The 502 Bad Gateway error is an HTTP status code that occurs when a server
acting as a gateway or proxy receives an invalid or faulty response from another
server in the communication chain. This error indicates a problem with the
communication between the involved servers and can result in disruption of
internet services. The 502 Bad Gateway error is considered one of the most
common error codes on the internet and can occur in various scenarios.

What causes 502's?
Insert daddy joke here
•Overloaded server
•Misconfigured infrastructure - firewall restrictions, invalid DNS, or inadequate
routing rules
•Upstream not functioning properly - server failure, software issues, etc.
•Network issues - unstable connections, packet loss, etc.

Fixing the Jeans...I mean 502's
•Look at the Why's (see previous slide)
•Other considerations:
•Increasing timeouts do not solve the problem. It just masks it!
•Application prematurely terminates sending response
•Poor application performance
•An error occurred while transmitting response
•Look into reducing or eliminating buffering

This is just the beginning

Other mitigation strategies not discussed
These are worth looking into
•Idempotency
•Caching
•Database Sharding
•Dealing with hardware failure
•DDoS mitigation due to attacks

The Anatomy of Failure - Lessons from running systems to serve millions of people

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

The Anatomy of Failure - Lessons from running systems to serve millions of people

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

MGV Residential Design projects for different clients, including a New Mexico Adobe project-1-.pdf

EUNITED_Advocacy and Public Engagement through Visual Media

DESIGN THINKINGGG PPT 2 TOPIC IDEATION.pptx

DESIGN THINKING CHAPTER 1 PPTT PPT 1.pptx

Hinduism and Its History - PowerPoint Slides.pptx

Service Attributes of Manufactured Parts.pptx