You cant Test everything, but you should monitor it (OpenSearchCon)

michilehr 12 views 43 slides May 10, 2024
Slide 1
Slide 1 of 43
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43

About This Presentation

We had an incident which did occur in our warehouse at KRUU. The downloading of the photos was very slow from one to the other day - well we thought that the problem started on this day.

Actually we did notice this very late and the problem started two years ago but we did notice this very late du...


Slide Content

YOU CAN’T TEST EVERYTHING
BUT YOU SHOULD MONITOR IT!
OpenSearchCon Europe 2024
Our journey to OpenSearch

Hi, I am Michi!

Head of Code at
@michilehr

"As Europe's leading photo booth provider, we have
made it our mission to help our brides and grooms
with their complete journey to their dream wedding.
This is something we work tirelessly on with our
team."
What is doing?
Philipp Schreiber - Co-Founder, KRUU.com

Photo Booth
Cycle
Photo Booth
Cycle

The Incident

Photo Booth
Cycle

Good
~12 MB/s

Bad
0.95 MB/s

??????

-Hardware error
-Bug in code
-Network
-OS
What could it be?

Transferring some sample
data was fast
What could it be?

1.When did it start?
2.Why did it happen?
3.How to prevent?
4.How to notice early?
Investigate

When did it start?
We had data in our Slack Channel, but…

1.Write a script to extract the data as CSV
2.Import data to MySQL
3.Write query to aggregate by day
4.Create nice chart

Many
hours
later

Started long time ago…

What happened?

First day at the new warehouse

Network configuration error

How to prevent or test?

How to notice early?

??????

Metrics
??????

But HOW to notice early?

Monitoring
and
Alerts!

Monitor

Monitor Data Source

Monitor Data Source Query

Monitor Data Source Query Trigger

Monitor Data Source Query Trigger Notification

Online Photo Booths alert
What else?

404 monitoring
What’s Next?

YOU CAN’T TEST EVERYTHING
BUT YOU SHOULD MONITOR IT!

Thank you for your time!

Questions?

Feedback?

Notes?