"What does it really mean for your system to be available, or how to define what to measure", Daniil Mazepin
fwdays
265 views
21 slides
Jun 18, 2024
Slide 1 of 21
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
About This Presentation
We will talk about system monitoring from a few different angles. We will start by covering the basics, then discuss SLOs, how to define them, and why understanding the business well is crucial for success in this exercise.
Size: 2.2 MB
Language: en
Added: Jun 18, 2024
Slides: 21 pages
Slide Content
Agenda
02
Who is Daniil? Why SLOs?
Terminology What to measure?
03
Who is Daniil?
Software Engineering Manager / Head of
Engineering with over 13 years of experience
at companies of varying sizes and stages of
maturity, ranging from small start-ups to
Facebook.
Experience spans multiple domains including
fintech, social media, e-commerce, and
gambling, utilising both top-down and
bottom-up approaches.
Terminology
Reliability
04
The system or service performs in the expected
way, when it’s required to do so.
Terminology
Service Level Indicator
(SLI)
05
A quantifiable measure
of service reliability.
Terminology
Service Level Objective
(SLO)
06
A reliability target
for an SLI.
Terminology
Service Level Agreement
(SLA)
07
A contract (usually legally binding) between
providers and customers of what happens if an SLO
is not met.
Terminology
Error Budget
08
An SLO implies an acceptable level of unreliability.
100 - SLO =
Error Budget for
the next X days
Terminology
SLO & Error Budget
Windows
09
Fixed Rolling
Calendars - per week, per month, per
quarter etc.
Works well for internal reporting
purposes.
Crucial for planning reliability work.
More closely aligned with the user
experience because users’ trust does not
magically recover on the first day of
each month.
Terminology
Burn Rates
10
The rate at which the allowed number of errors is
consumed.
Error Budget for
the next X days
Burn Rate < 1 Burn Rate > 1
Why SLOs?
But why would we invest in defining
and measuring SLOs?
11
To address the tension between the pace of
innovation and service reliability.
What to measure?
A service level indicator (SLI): A metric of a specific aspect of
your service.
Duration: The window where SLI is measured. This can be
calendar-based or a rolling window.
A target: The value (or range of values) that the SLI should meet
in the given duration in a healthy service.
Choose your SLO
12
What to measure?
The metric directly relates to user
happiness.
The metric deterioration correlates
with outages.
The metric provides a good signal-to-
noise ratio.
The metric scales monotonically and
linearly with customer happiness.
Characteristics of a good
metric
13
What to measure?
Request-driven services
14
Availability: The fraction of valid requests served successfully.
Latency: The fraction of valid requests served faster than a
threshold.
Quality (*): The fraction of valid requests served without of
degradation of service.
What to measure?
Availability?..
15
Uptime?..
Availability still answers whether
the system is up, but in more
precise way, then measuring the
time since the system was last
down.
Today services might be partially
down, which is the factor which
uptime doesn’t capture very well.
What to measure?
Data processing services
16
Coverage: The amount of data that has been processed,
expressed as a fraction. For example, 95%.
Correctness: The fraction of output data deemed to be correct.
For example, 99.99%.
Freshness: The freshness of the source data or aggregated
output data, expressed as a fraction.
Throughput: The fraction of time where data processing rate
was faster than a threshold.
What to measure?
How many 9s
do we need?
17
What to measure?
Why not 100%?
18
“100% is the wrong reliability target for basically
everything.”
Ben Treynor Sloss, founder of SRE at Google
What to measure?
Iterate!
19
“Picking the wrong number is better than picking no
number.”
from SRE.Google
What to measure?
Iterate! x2
20
Align dependencies.
Build complex SLOs where it makes sense.