You cant Test everything, but you should monitor it (OpenSearchCon)
michilehr
12 views
43 slides
May 10, 2024
Slide 1 of 43
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
About This Presentation
We had an incident which did occur in our warehouse at KRUU. The downloading of the photos was very slow from one to the other day - well we thought that the problem started on this day.
Actually we did notice this very late and the problem started two years ago but we did notice this very late du...
We had an incident which did occur in our warehouse at KRUU. The downloading of the photos was very slow from one to the other day - well we thought that the problem started on this day.
Actually we did notice this very late and the problem started two years ago but we did notice this very late due to the reduced rentals because of Covid19.
This will never happen again thanks to our new metrics and alerting powered by the OpenSearch!
Size: 3.99 MB
Language: en
Added: May 10, 2024
Slides: 43 pages
Slide Content
YOU CAN’T TEST EVERYTHING
BUT YOU SHOULD MONITOR IT!
OpenSearchCon Europe 2024
Our journey to OpenSearch
Hi, I am Michi!
Head of Code at
@michilehr
"As Europe's leading photo booth provider, we have
made it our mission to help our brides and grooms
with their complete journey to their dream wedding.
This is something we work tirelessly on with our
team."
What is doing?
Philipp Schreiber - Co-Founder, KRUU.com
Photo Booth
Cycle
Photo Booth
Cycle
The Incident
Photo Booth
Cycle
Good
~12 MB/s
Bad
0.95 MB/s
??????
-Hardware error
-Bug in code
-Network
-OS
What could it be?
Transferring some sample
data was fast
What could it be?
1.When did it start?
2.Why did it happen?
3.How to prevent?
4.How to notice early?
Investigate
When did it start?
We had data in our Slack Channel, but…
1.Write a script to extract the data as CSV
2.Import data to MySQL
3.Write query to aggregate by day
4.Create nice chart
Many
hours
later
Started long time ago…
What happened?
First day at the new warehouse
Network configuration error
How to prevent or test?
How to notice early?
??????
Metrics
??????
But HOW to notice early?
Monitoring
and
Alerts!
Monitor
Monitor Data Source
Monitor Data Source Query
Monitor Data Source Query Trigger
Monitor Data Source Query Trigger Notification
Online Photo Booths alert
What else?
404 monitoring
What’s Next?
YOU CAN’T TEST EVERYTHING
BUT YOU SHOULD MONITOR IT!