Weathering the storm, how to manage failure in the cloud

setoide 7 views 50 slides Sep 02, 2024

Slide 1 of 50

About This Presentation

Cloud computing has been a game-changer for businesses striving to enhance digital experiences. However, managing services at scale with modern cloud tech isn't always a cakewalk. Failure is inevitable and the way we prepare, react and learn from it will define the success or failure of our jour...

Size: 50.89 MB

Language: en

Added: Sep 02, 2024

Slides: 50 pages

Slide Content

How to navigate failure in the cloud
Trent Hornibrook
Javier Turegano

Hi! My name is Trent.

Hi! My name is Javier.

Everything
fails
all the time
Werner Voguels

At the begining...
The year is 2011 and we are learning
how to use the Cloud...

The zombie apocalypse
That Time I Accidentally Terminated 600 Instances

Property photos

Myfun.com
New greenfield project using the
AWS Beijing region.

Introduce failure to your design

Introduce failure to your design
HTTP 500

AWS Sydney outage 2016
How REA (Mostly) Survived The Stormy Apocalpse

Multi region Active / Active

Cattle and not pets
Design to be recoverable
Shift left thinking
Testing for failure
Key lessons

New Year - 2021

Provision
Service
NETWORK
COMPUTE MONITORING.
Logs
Metrics
Traces
Dashboards
Alerting
View of core services
OKOKOK

Provision
Service
NETWORK
COMPUTE MONITORING.
Logs
Metrics
Traces
Dashboards
Alerting
Everything is normal (APAC -> EMEA)
OKOKOK

PACKET LOST
Provision
Service
NETWORK
COMPUTE MONITORING
Packet lost commences - First alerts
Logs
Metrics
Traces
Dashboards
Alerting
OKOKOK

PACKET LOST
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
We lose our visibility
OKOKOK

PACKET LOST
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
Minor downscale (CPU)
OKOKOK

PACKET LOST
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
Massive upscale (Apache thread utilization)
SLOWSLOWSLOW

PACKET LOST
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
Provision Service is saturated (limits)
DOWNDOWNDOWN

PACKET LOST
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
We fix provision service
SLOW -SLOW -SLOW -
RECOVERINGRECOVERINGRECOVERING

Provision
Service
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
PACKET LOST
NETWORK
Enough web capacity to serve Slack
OKOKOK

AWS SCALES TRANSIT GW.
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
AWS scales transit gateway
OKOKOK

AWS SCALES TRANSIT GW.
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Everything back to normal
Dashboards
Alerting
Blog post
OKOKOK

All Hands on Deck
Blog post

Slack outage
January 4th

Learning from
incidents
Incident reviews
Understand the timeline
Identify contributing factors
Good, bad and how we got lucky
Follow up items
Share lessons learned and key
patterns with the rest of the org

Learning from
incidents
Long term fixes that were put in place
Adjusted scaling policies
Fully re-designed provision service
Worked with AWS to improve
Transit GW auto-scaling
Expedited our migration to the new
network architecture

Learn more from other Slack incidents
Slack’s Incident on 2-22-22
A Terrible, Horrible, No-Good,
Very Bad Day at Slack

Chaos Engineering Tools

Other lessons learned

Monzo’s Kube Outage (2017)
Presentation

Canva’s Recommender system (2021)
Mayur & Thien’s
Blog post

amazon.com Black Friday outage
Article

Congestion collapse - when retries are bad
re:Invent 2023 - Surviving overloads (NET402)

Multi-Region is a hard problem for writes
AWS Multi-Region Fundamentals

Global services, regional services and zonal services
AWS Fault Isolation Boundaries

Maybe we just need to know our fault boundaries?
AWS Fault Isolation Boundaries

Original Multi-AZ architecture at Slack
Failures within one AZ could affect the
whole system.
Gray failures paper

Celular Architecture at Slack
AZs as cells, and cells may be drained
Blog post

Things break all the time
Some patterns to build more resilient
systems in the Cloud
A learning culture is key
Popular ways to meitate
Practice your incident management
processes
Share your incident lessons with the
Community
Summary
Some takeaways to
weather the storm

Ice breaker
Hey, what’s a
memorable incident
you’ve been part of or
read about?

Thank you
for listening!
Feel free to DM us your
questions any time:
@setoide
@mysqldbahelp

Weathering the storm, how to manage failure in the cloud

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Weathering the storm, how to manage failure in the cloud

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx