Weathering the storm, how to manage failure in the cloud

setoide 7 views 50 slides Sep 02, 2024
Slide 1
Slide 1 of 50
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50

About This Presentation

Cloud computing has been a game-changer for businesses striving to enhance digital experiences. However, managing services at scale with modern cloud tech isn't always a cakewalk. Failure is inevitable and the way we prepare, react and learn from it will define the success or failure of our jour...


Slide Content

How to navigate failure in the cloud
Trent Hornibrook
Javier Turegano

Hi! My name is Trent.

Hi! My name is Javier.

Everything
fails
all the time
Werner Voguels

At the begining...
The year is 2011 and we are learning
how to use the Cloud...

The zombie apocalypse
That Time I Accidentally Terminated 600 Instances

Property photos

Myfun.com
New greenfield project using the
AWS Beijing region.

Introduce failure to your design

Introduce failure to your design

Introduce failure to your design

Introduce failure to your design

Introduce failure to your design
HTTP 500

AWS Sydney outage 2016
How REA (Mostly) Survived The Stormy Apocalpse

Multi region Active / Active

Cattle and not pets
Design to be recoverable
Shift left thinking
Testing for failure
Key lessons

New Year - 2021

Provision
Service
NETWORK
COMPUTE MONITORING.
Logs
Metrics
Traces
Dashboards
Alerting
View of core services
OKOKOK

Provision
Service
NETWORK
COMPUTE MONITORING.
Logs
Metrics
Traces
Dashboards
Alerting
Everything is normal (APAC -> EMEA)
OKOKOK

PACKET LOST
Provision
Service
NETWORK
COMPUTE MONITORING
Packet lost commences - First alerts
Logs
Metrics
Traces
Dashboards
Alerting
OKOKOK

PACKET LOST
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
We lose our visibility
OKOKOK

PACKET LOST
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
Minor downscale (CPU)
OKOKOK

PACKET LOST
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
Massive upscale (Apache thread utilization)
SLOWSLOWSLOW

PACKET LOST
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
Provision Service is saturated (limits)
DOWNDOWNDOWN

PACKET LOST
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
We fix provision service
SLOW -SLOW -SLOW -
RECOVERINGRECOVERINGRECOVERING

Provision
Service
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
PACKET LOST
NETWORK
Enough web capacity to serve Slack
OKOKOK

AWS SCALES TRANSIT GW.
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
AWS scales transit gateway
OKOKOK

AWS SCALES TRANSIT GW.
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Everything back to normal
Dashboards
Alerting
Blog post
OKOKOK

All Hands on Deck
Blog post

Slack outage
January 4th

Learning from
incidents
Incident reviews
Understand the timeline
Identify contributing factors
Good, bad and how we got lucky
Follow up items
Share lessons learned and key
patterns with the rest of the org

Learning from
incidents
Long term fixes that were put in place
Adjusted scaling policies
Fully re-designed provision service
Worked with AWS to improve
Transit GW auto-scaling
Expedited our migration to the new
network architecture

Learn more from other Slack incidents
Slack’s Incident on 2-22-22
A Terrible, Horrible, No-Good,
Very Bad Day at Slack

Chaos Engineering Tools

Other lessons learned

Monzo’s Kube Outage (2017)
Presentation

Canva’s Recommender system (2021)
Mayur & Thien’s
Blog post

amazon.com Black Friday outage
Article

Congestion collapse - when retries are bad
re:Invent 2023 - Surviving overloads (NET402)

Multi-Region is a hard problem for writes
AWS Multi-Region Fundamentals

Global services, regional services and zonal services
AWS Fault Isolation Boundaries

Maybe we just need to know our fault boundaries?
AWS Fault Isolation Boundaries

Original Multi-AZ architecture at Slack
Failures within one AZ could affect the
whole system.
Gray failures paper

Celular Architecture at Slack
AZs as cells, and cells may be drained
Blog post

Things break all the time
Some patterns to build more resilient
systems in the Cloud
A learning culture is key
Popular ways to meitate
Practice your incident management
processes
Share your incident lessons with the
Community
Summary
Some takeaways to
weather the storm

Ice breaker
Hey, what’s a
memorable incident
you’ve been part of or
read about?

Thank you
for listening!
Feel free to DM us your
questions any time:
@setoide
@mysqldbahelp