Weathering the storm, how to manage failure in the cloud
setoide
7 views
50 slides
Sep 02, 2024
Slide 1 of 50
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
About This Presentation
Cloud computing has been a game-changer for businesses striving to enhance digital experiences. However, managing services at scale with modern cloud tech isn't always a cakewalk. Failure is inevitable and the way we prepare, react and learn from it will define the success or failure of our jour...
Cloud computing has been a game-changer for businesses striving to enhance digital experiences. However, managing services at scale with modern cloud tech isn't always a cakewalk. Failure is inevitable and the way we prepare, react and learn from it will define the success or failure of our journey.
Join Trent and Javier in this session as they share some stories of different failure modes they’ve encountered during their time at REA Group, Slack, Catch and other tech businesses. These aren't just stories – they're valuable lessons. Get insights, practical tips, and discover strategies to handle the bumps in the road when dealing with cloud tech.
Size: 50.89 MB
Language: en
Added: Sep 02, 2024
Slides: 50 pages
Slide Content
How to navigate failure in the cloud
Trent Hornibrook
Javier Turegano
Hi! My name is Trent.
Hi! My name is Javier.
Everything
fails
all the time
Werner Voguels
At the begining...
The year is 2011 and we are learning
how to use the Cloud...
The zombie apocalypse
That Time I Accidentally Terminated 600 Instances
Property photos
Myfun.com
New greenfield project using the
AWS Beijing region.
Introduce failure to your design
Introduce failure to your design
Introduce failure to your design
Introduce failure to your design
Introduce failure to your design
HTTP 500
AWS Sydney outage 2016
How REA (Mostly) Survived The Stormy Apocalpse
Multi region Active / Active
Cattle and not pets
Design to be recoverable
Shift left thinking
Testing for failure
Key lessons
New Year - 2021
Provision
Service
NETWORK
COMPUTE MONITORING.
Logs
Metrics
Traces
Dashboards
Alerting
View of core services
OKOKOK
Provision
Service
NETWORK
COMPUTE MONITORING.
Logs
Metrics
Traces
Dashboards
Alerting
Everything is normal (APAC -> EMEA)
OKOKOK
PACKET LOST
Provision
Service
NETWORK
COMPUTE MONITORING
Packet lost commences - First alerts
Logs
Metrics
Traces
Dashboards
Alerting
OKOKOK
PACKET LOST
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
We lose our visibility
OKOKOK
PACKET LOST
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
Minor downscale (CPU)
OKOKOK
PACKET LOST
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
Massive upscale (Apache thread utilization)
SLOWSLOWSLOW
PACKET LOST
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
Provision Service is saturated (limits)
DOWNDOWNDOWN
PACKET LOST
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
We fix provision service
SLOW -SLOW -SLOW -
RECOVERINGRECOVERINGRECOVERING
Provision
Service
COMPUTE MONITORING
Logs
Metrics
Traces
Dashboards
Alerting
PACKET LOST
NETWORK
Enough web capacity to serve Slack
OKOKOK
AWS SCALES TRANSIT GW.
Provision
Service
NETWORK
COMPUTE MONITORING
Logs
Metrics
Traces
Everything back to normal
Dashboards
Alerting
Blog post
OKOKOK
All Hands on Deck
Blog post
Slack outage
January 4th
Learning from
incidents
Incident reviews
Understand the timeline
Identify contributing factors
Good, bad and how we got lucky
Follow up items
Share lessons learned and key
patterns with the rest of the org
Learning from
incidents
Long term fixes that were put in place
Adjusted scaling policies
Fully re-designed provision service
Worked with AWS to improve
Transit GW auto-scaling
Expedited our migration to the new
network architecture
Learn more from other Slack incidents
Slack’s Incident on 2-22-22
A Terrible, Horrible, No-Good,
Very Bad Day at Slack
Chaos Engineering Tools
Other lessons learned
Monzo’s Kube Outage (2017)
Presentation
Canva’s Recommender system (2021)
Mayur & Thien’s
Blog post
amazon.com Black Friday outage
Article
Congestion collapse - when retries are bad
re:Invent 2023 - Surviving overloads (NET402)
Multi-Region is a hard problem for writes
AWS Multi-Region Fundamentals
Global services, regional services and zonal services
AWS Fault Isolation Boundaries
Maybe we just need to know our fault boundaries?
AWS Fault Isolation Boundaries
Original Multi-AZ architecture at Slack
Failures within one AZ could affect the
whole system.
Gray failures paper
Celular Architecture at Slack
AZs as cells, and cells may be drained
Blog post
Things break all the time
Some patterns to build more resilient
systems in the Cloud
A learning culture is key
Popular ways to meitate
Practice your incident management
processes
Share your incident lessons with the
Community
Summary
Some takeaways to
weather the storm
Ice breaker
Hey, what’s a
memorable incident
you’ve been part of or
read about?
Thank you
for listening!
Feel free to DM us your
questions any time:
@setoide
@mysqldbahelp