Failure Engineering - Architecting Resilient API's

akisaxena 500 views 11 slides Jul 26, 2024
Slide 1
Slide 1 of 11
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11

About This Presentation

How to build more fault tolerant API's to withstand scale, security and maintainability challenges


Slide Content

Failure Engineering - API Edition





Requires first principles thinking
Simplicity at it's core
Paranoia is good
Tool - Imagine an impossible scenario
and hold your system up to that
Failure Engineering
Planning to fail, so that you react better during
incidents

API - The Usual Stuff






Status Codes
Latency
Resource utilisation - only
because your CFO makes you :)
Scaling Ladders
Performance Testing
Endurance Testing




Miss the forest for the trees - Focus
on per service, not the whole system
Serviceability, not uptime
Some services can die
What do we miss usually?

Enough with the suspense!




Observability - Where is it
smoking?
Causality with Topology - It's
smoking here, but broken there
Lazy Origins - Do less at origin,
create run-off's
Escape Hatches - Whats your plan
B, C, X?






Dashboard - Visualise your topology
Monitor - Know your breaking points &
monitor them
Noise Management - Traffic patterns are
temporal + seasonal
Alerting Maturity - Build a mature alert &
incident response SOP.
Vendors - Vendors are part of your fabric as
well, don't skip monitoring them
Observability

Protect your origin




Scale Selectively - Answer new
questions only
Walled Garden - Only whitelisted
requests, create a perimeter for
security
Create Runoffs - Nobody needs
to see your origin is down
Pre-Warm - Anticipate the wave








Is it cacheable? - Then Yes
Leverage the middle tier
Leverage the edge device
CDN = Shock Absorption + Hooks to inject
behaviours
Some high performance use-cases should skip
CDN
Evaluate value of "drag" - Intermediate blocks
allow for more “in-transit” decision making,
but they also add some drag
Nothing is free - CDN's require tuning &
monitoring too
Should you CDN?

Danger - Here be dragons



Row of houses fires - Cooperating
API's fail, mid tiers fail
Dam Bursts - Complete failure
coupled with untested "side-
mitigations" bring tsunami's
Poor Eviction Strategies - Don't
forget the basics





Delegate : Don’t come to the origin,
resolve known answers closer to the
client. Move behaviors to edge.
Facade: Create escape hatch, if
origin has troubles
Principle : Don’t answer questions
for which answers are given already
Bare Metal - Monitor as close to
metal, monitor what matters

Patterns - To Take Away

There is no replacement for rigour & discipline in production