"What does it really mean for your system to be available, or how to define what to measure", Daniil Mazepin

fwdays 265 views 21 slides Jun 18, 2024
Slide 1
Slide 1 of 21
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21

About This Presentation

We will talk about system monitoring from a few different angles. We will start by covering the basics, then discuss SLOs, how to define them, and why understanding the business well is crucial for success in this exercise.


Slide Content

Agenda
02
Who is Daniil? Why SLOs?
Terminology What to measure?

03
Who is Daniil?
Software Engineering Manager / Head of
Engineering with over 13 years of experience
at companies of varying sizes and stages of
maturity, ranging from small start-ups to
Facebook.
Experience spans multiple domains including
fintech, social media, e-commerce, and
gambling, utilising both top-down and
bottom-up approaches.

Terminology
Reliability
04
The system or service performs in the expected
way, when it’s required to do so.

Terminology
Service Level Indicator
(SLI)
05
A quantifiable measure
of service reliability.

Terminology
Service Level Objective
(SLO)
06
A reliability target
for an SLI.

Terminology
Service Level Agreement
(SLA)
07
A contract (usually legally binding) between
providers and customers of what happens if an SLO
is not met.

Terminology
Error Budget
08
An SLO implies an acceptable level of unreliability.
100 - SLO =
Error Budget for
the next X days

Terminology
SLO & Error Budget
Windows
09
Fixed Rolling
Calendars - per week, per month, per
quarter etc.
Works well for internal reporting
purposes.
Crucial for planning reliability work.
More closely aligned with the user
experience because users’ trust does not
magically recover on the first day of
each month.

Terminology
Burn Rates
10
The rate at which the allowed number of errors is
consumed.
Error Budget for
the next X days
Burn Rate < 1 Burn Rate > 1

Why SLOs?
But why would we invest in defining
and measuring SLOs?
11
To address the tension between the pace of
innovation and service reliability.

What to measure?
A service level indicator (SLI): A metric of a specific aspect of
your service.
Duration: The window where SLI is measured. This can be
calendar-based or a rolling window.
A target: The value (or range of values) that the SLI should meet
in the given duration in a healthy service.
Choose your SLO
12

What to measure?
The metric directly relates to user
happiness.
The metric deterioration correlates
with outages.
The metric provides a good signal-to-
noise ratio.
The metric scales monotonically and
linearly with customer happiness.
Characteristics of a good
metric
13

What to measure?
Request-driven services
14
Availability: The fraction of valid requests served successfully.
Latency: The fraction of valid requests served faster than a
threshold.
Quality (*): The fraction of valid requests served without of
degradation of service.

What to measure?
Availability?..
15
Uptime?..
Availability still answers whether
the system is up, but in more
precise way, then measuring the
time since the system was last
down.
Today services might be partially
down, which is the factor which
uptime doesn’t capture very well.

What to measure?
Data processing services
16
Coverage: The amount of data that has been processed,
expressed as a fraction. For example, 95%.
Correctness: The fraction of output data deemed to be correct.
For example, 99.99%.
Freshness: The freshness of the source data or aggregated
output data, expressed as a fraction.
Throughput: The fraction of time where data processing rate
was faster than a threshold.

What to measure?
How many 9s
do we need?
17

What to measure?
Why not 100%?
18
“100% is the wrong reliability target for basically
everything.”
Ben Treynor Sloss, founder of SRE at Google

What to measure?
Iterate!
19
“Picking the wrong number is better than picking no
number.”
from SRE.Google

What to measure?
Iterate! x2
20
Align dependencies.
Build complex SLOs where it makes sense.

Thank you!
21
Do you have any questions?