stackconf 2024 | Bring Your Chaos to Work Day by Dionysios Tsoumas

NETWAYS 20 views 24 slides Jul 02, 2024

Slide 1 of 24

About This Presentation

Join us for an emulated Chaos at Work Day. We’ll take over our pre-production environment and run a series of chaos experiments to test the resilience of our systems. We’ll be using a combination of LinkerD, Chaos Mesh, and K6 fault injections to simulate real-world scenarios and see how our mon...

Size: 471.37 KB

Language: en

Added: Jul 02, 2024

Slides: 24 pages

Slide Content

Bring your chaos
To work Day
Dionysis Tsoumas, Director of DevOps

Ok, let’s do this!
01
02
03
04
05
Our stack
How did we get here?
Our tools
Tests and dashboards
Gamifying results

Bonus
What is chaos
engineering?

Our stack
Monitoring & observability
●Prometheus operator
●Thanos
●Grafana
●Loki
●Tempo
●K6
●OpsGenie
The Rest
●Kubernetes clusters on GCP
●GCP monitoring for fallback
●Sentry
●LinkerD
●Chaos mesh

How did we get here?
Is chaos engineering still relevant?

Right-click image to replace
Login service
Almost real life example
Bundles service
Subscriptions
service
User request

2 days event, where we tried
to bring down our staging
and monitoring platforms
Bring your chaos to work
Possibilities were endless

A few special words
about LinkerD
https://linkerd.io/

And of course chaos
mesh
https://chaos-mesh.org/

Prep work: Let’s talk about tests

Chaos mesh
●Stress Test
●Pod Failures
●Network Delay
LinkerD
●Fault injection

Should we have a peak?
(Of course we should)

Prep work: Let’s talk about dashboards

Our roles
●Chronicler – Keeping notes
●Responder – Keeping track of alerts
●(2x) Maestros – Running the experiments
●Firefighter – Handling communication
●Watcher – Monitoring for errors

Is it worth emulating users?
(Yes and no)
Quick question:

Ok, time for results and
interesting pieces for next
steps.

What did we learn?

Right-click image to replace
Let’s play a game.

Right-click image to replace
How many P1 alerts we triggered?

Left: Between 0 - 5 Right: Between 6 - 15

We only got one
(1) P1 alert
-Good old Prometheus
(PrometheusIsntanceIsDown)

Action Items based on:
-P1 alerts are too flexible
-Or our tests were not aggressive
enough

Right-click image to replace
What about P2 vs P3?

Left: More P2 Right: More P3

We only got P3
alerts
- Triggered by LinkerD -
ProxyInboundSuccessRate
(You know what this means)

Action Items:
-At least for monitoring
namespaces, P3->P2/P1

Right-click image to replace
What about P2?

Left: We didn’t have
any P2s
Right: Eventual
consistency

Eventually we
would get P2/P3
alerts
- Triggered at 20 min
-NotEnoughReplicas or OOM

Action Items:
-Improve alerts on monitoring
namespaces
-P3 at 15’, P2 at 30’

Ou biggest finding?
A retrying mechanism would solve most of our problems and help a lot with
blank screens. Also, we are not as cool as we thought.

●We introduced LinkerD’s Retry fallback mechanism
●Retries, timeouts and retry budgets
●You can monitor them as well!

Thanks for
participating!
Have more questions?
Reach me at [email protected]
LinkedIn: Dionysis Tsoumas

stackconf 2024 | Bring Your Chaos to Work Day by Dionysios Tsoumas

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

stackconf 2024 | Bring Your Chaos to Work Day by Dionysios Tsoumas

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx