stackconf 2024 | Bring Your Chaos to Work Day by Dionysios Tsoumas

NETWAYS 20 views 24 slides Jul 02, 2024
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

Join us for an emulated Chaos at Work Day. We’ll take over our pre-production environment and run a series of chaos experiments to test the resilience of our systems. We’ll be using a combination of LinkerD, Chaos Mesh, and K6 fault injections to simulate real-world scenarios and see how our mon...


Slide Content

Bring your chaos
To work Day
Dionysis Tsoumas, Director of DevOps

Ok, let’s do this!
01
02
03
04
05
Our stack
How did we get here?
Our tools
Tests and dashboards
Gamifying results

Bonus
What is chaos
engineering?

Our stack
Monitoring & observability
●Prometheus operator
●Thanos
●Grafana
●Loki
●Tempo
●K6
●OpsGenie
The Rest
●Kubernetes clusters on GCP
●GCP monitoring for fallback
●Sentry
●LinkerD
●Chaos mesh

How did we get here?
Is chaos engineering still relevant?

Right-click image to replace
Login service
Almost real life example
Bundles service
Subscriptions
service
User request

2 days event, where we tried
to bring down our staging
and monitoring platforms
Bring your chaos to work
Possibilities were endless

A few special words
about LinkerD
https://linkerd.io/

And of course chaos
mesh
https://chaos-mesh.org/

Prep work: Let’s talk about tests

Chaos mesh
●Stress Test
●Pod Failures
●Network Delay
LinkerD
●Fault injection

Should we have a peak?
(Of course we should)

Prep work: Let’s talk about dashboards

Our roles
●Chronicler – Keeping notes
●Responder – Keeping track of alerts
●(2x) Maestros – Running the experiments
●Firefighter – Handling communication
●Watcher – Monitoring for errors

Is it worth emulating users?
(Yes and no)
Quick question:

Ok, time for results and
interesting pieces for next
steps.

What did we learn?

Right-click image to replace
Let’s play a game.

Right-click image to replace
How many P1 alerts we triggered?


Left: Between 0 - 5 Right: Between 6 - 15

We only got one
(1) P1 alert
-Good old Prometheus
(PrometheusIsntanceIsDown)






Action Items based on:
-P1 alerts are too flexible
-Or our tests were not aggressive
enough

Right-click image to replace
What about P2 vs P3?


Left: More P2 Right: More P3

We only got P3
alerts
- Triggered by LinkerD -
ProxyInboundSuccessRate
(You know what this means)






Action Items:
-At least for monitoring
namespaces, P3->P2/P1

Right-click image to replace
What about P2?


Left: We didn’t have
any P2s
Right: Eventual
consistency

Eventually we
would get P2/P3
alerts
- Triggered at 20 min
-NotEnoughReplicas or OOM






Action Items:
-Improve alerts on monitoring
namespaces
-P3 at 15’, P2 at 30’

Ou biggest finding?
A retrying mechanism would solve most of our problems and help a lot with
blank screens. Also, we are not as cool as we thought.

●We introduced LinkerD’s Retry fallback mechanism
●Retries, timeouts and retry budgets
●You can monitor them as well!

Thanks for
participating!
Have more questions?
Reach me at [email protected]
LinkedIn: Dionysis Tsoumas