Beginner's Guide to Observability@Devoxx PL 2024

michniczscribd 263 views 119 slides Jun 20, 2024
Slide 1
Slide 1 of 119
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119

About This Presentation

....


Slide Content

BEGINNER'S GUIDE TO
BEGINNER'S GUIDE TOBEGINNER'S GUIDE TOBEGINNER'S GUIDE TOBEGINNER'S GUIDE TOBEGINNER'S GUIDE TOBEGINNER'S GUIDE TOBEGINNER'S GUIDE TOBEGINNER'S GUIDE TOBEGINNER'S GUIDE TOBEGINNER'S GUIDE TOBEGINNER'S GUIDE TO
OBSERVABILITY
OBSERVABILITYOBSERVABILITYOBSERVABILITYOBSERVABILITYOBSERVABILITYOBSERVABILITYOBSERVABILITYOBSERVABILITYOBSERVABILITYOBSERVABILITYOBSERVABILITY
Michał Niczyporuk
@mihn

ABOUT ME

WHAT IS
WHAT ISWHAT ISWHAT ISWHAT ISWHAT ISWHAT ISWHAT ISWHAT ISWHAT ISWHAT ISWHAT IS
OBSERVABILITY?
OBSERVABILITY?OBSERVABILITY?OBSERVABILITY?OBSERVABILITY?OBSERVABILITY?OBSERVABILITY?OBSERVABILITY?OBSERVABILITY?OBSERVABILITY?OBSERVABILITY?OBSERVABILITY?

OBSERVABILITY
OBSERVABILITYOBSERVABILITYOBSERVABILITYOBSERVABILITYOBSERVABILITYOBSERVABILITYOBSERVABILITYOBSERVABILITYOBSERVABILITYOBSERVABILITYOBSERVABILITY
amercanized: Observability -> o11y

THE DEFINITION
“In control theory, observability is a
measure of how well internal states of a
system can be inferred from knowledge of
its external outputs”
Wikipedia

LET'S ASK THE ORACLE

WHAT FOR?
Way of determining state of the system
Observe trends
Spot anomalies
Debug errors
Gather data to support decision process
Measure user experience

LOGGING
LOGGINGLOGGINGLOGGINGLOGGINGLOGGINGLOGGINGLOGGINGLOGGINGLOGGINGLOGGINGLOGGING

LOGGING PRIME DIRECTIVE
Don't use System.out.println()

LOGGING LIBRARY ECOSYSTEM

DEFINE LOGGER
public class UserService {
private static final org.slf4j.Logger LOGGER =
org.slf4j.LoggerFactory.getLogger(UserService. class);
(...)
}

USE LOGGER
private int myMethod(int x) {
LOGGER.info("running my great code with x={}" , x);
int y = methodX(x);
LOGGER.info("my great code finished with y={}" , y);
return y;
}
2023-03-21 19:24:43,829 [main] INFO UserService - running my gre
2023-03-21 19:24:43,829 [main] INFO UserService - my great code

"TALKING TO A VOID" DEBUGGING
private void thisCallsMethodX() {
LOGGER.info("PLEASE WORK");
methodY();
LOGGER.info("SHOULD HAVE WORKED" );
}

DOS
Log meaningful checkpoints
Use correct logging levels
Add ids to logging context:
e.g. correlation id, user id
High-throughput = async appender

DO NOTS??????
Logs can be lost
Logs are not audit trail
Logs are not metrics
Logs are not data warehouse/lake
Alerts based on logs are flimsy - metrics-based alerts
are better

WHAT NOT TO LOG
personal information, secrets, session tokens etc.
(thanks )
OWASP Logging Cheat sheet
@piotrprz

CENTRALIZED LOGS
CENTRALIZED LOGSCENTRALIZED LOGSCENTRALIZED LOGSCENTRALIZED LOGSCENTRALIZED LOGSCENTRALIZED LOGSCENTRALIZED LOGSCENTRALIZED LOGSCENTRALIZED LOGSCENTRALIZED LOGSCENTRALIZED LOGS

CENTRAL STORAGE
Software that can ingest, process and search the logs

PUBLISHING LOGS

LOGS FORMAT
Unified format (e.g. JSON, logfmt, OpenTelemetry
Protocol OTPL, vendor specific)
regexes = two problems

GROWTH RULE OF THUMB

TOOLING
Cloud providers:
AWS: Cloudwatch Logs
Azure: Monitor Logs
GCP: Cloud Logging
Elastic Stack = ElasticSearch + Logstash/Filebeat +
Kibana
Graylog
Grafana Loki

METRICS
METRICSMETRICSMETRICSMETRICSMETRICSMETRICSMETRICSMETRICSMETRICSMETRICSMETRICS

METRICS
numbers
indexed over time
and multiple additional dimensions
=== timeseries

STORAGE
timeseries database

PUBLISHING METRICS

FORMAT
Storage specific
Vendor neutral: OpenTelemetry

METRICS TYPES
Prometheus

COUNTER
Used for: requests count, jobs count, errors

GAUGE
Used for memory usage, thread/connection pools
count, etc.
Tip: Store actual current and maximum value, not the
percentages

HISTOGRAM
used for request latencies and sizes
can calculate any percentiles via query
tip: preconfigured bucket sizes = performance better

SUMMARY
Histogram - but with percentiles (e.g. p50, p75, p95)
Calculated client side - lighter then histograms
Caveat: calculated per scrape target

METRIC ATTRIBUTES

METRIC RESOLUTION
Main dimension
How often metrics are probed
Long-term storage = Down sampling/roll up
1min up to 7 days -> 10 min upto 30 days, etc.
YMMV - think about business cycles (e.g. Black Friday)

METRIC TAGS/LABELS
Additional dimensions
Metadata: HTTP attributes, service name, hosts, cloud
regions, etc.
Metrics labels don't like high cardinality
route="/users/:id" ✅
route="/users/2137" ❌

WHAT TO MEASURE?
Kirk Pepperdine's The Box model
Your framework/libraries cover some basics already

KIRK PEPPERDINE'S THE BOX
"People": incoming requests, messages, jobs...
Application (Netty, Tomcat, Spring, etc.): thread pools,
queue sizes, requests...
Runtime metrics (JVM): GC, memory usage, threads...
Hardware: CPU, memory, I/O, disk usage

HOW TO MEASURE?

HOW TO MEASURE?
Latency
Traffic
Errors
Saturation
The Four Golden Signals:

MEASUREMENT METHODS
RED vs USE

RED
Rate
Error
Duration
Application focused

USE
Utilization
Saturation
Errors
Infrastructure focused

ONE METRICS TO RULE THEM ALL
APDEX
https://www.apdex.org/
More user experience focused - mesures satisfaction
Single number

APDEX - SPLITTING THE POPULUS
Minimal sample size 100 - adjust time window
Measure as close to the user as possible

APDEX - CALCULATIONS
Result is a decimal between 0 and 1

APDEX - INTERPRETING THE RESULT
0.94 ≤ X ≤ 1 : excellent
0.85 ≤ X ≤ 0.93 : good
0.70 ≤ X ≤ 0.84 : fair
0.50 ≤ X ≤ 0.69 : poor
0.00 ≤ X ≤ 0.49 : unacceptable

APDEX - CAVEATS
Very generic metric - hides details
Should be monitored closely after deployments
Should measure one functionality
Shouldn't be only metric of application success

PERCENTILES
Latency distribution is not normal/Gaussian
Use high percentiles (p99+) - not averages or means
(p50)

DIFFERENCE OF OPINION?

COORDINATED OMISSION
1. Server mesures in wrong place
2. Requests can be stuck waiting for processing
Measure both clients and servers - if possible
"How not to measure latency" by Gil Tene

DEATH BY METRICS
Storing unnecessary amount of metrics
Made possible by automations
Bad for infrastructure and cloud bills
Bad for your mental health

ASSUMPTIONS ABOUT METRICSWRONG
1. Metrics can't be lost
2. Metrics are precise

GROWTH RULE OF THUMB

EXAMPLE - "MISSING" INDEX

TOOLS
OSS:
+
Libraries:
Cloud providers:
AWS: Cloudwatch Metrics
Azure: Monitor Metrics
GCP: Cloud Metrics
PrometheusGrafana
Thanos
Victoria Metrics
Elastic Stack
Micrometer - API abstraction
OpenTelemetry

DISTRIBUTED TRACING
DISTRIBUTED TRACINGDISTRIBUTED TRACINGDISTRIBUTED TRACINGDISTRIBUTED TRACINGDISTRIBUTED TRACINGDISTRIBUTED TRACINGDISTRIBUTED TRACINGDISTRIBUTED TRACINGDISTRIBUTED TRACINGDISTRIBUTED TRACINGDISTRIBUTED TRACING

TRACING
A
B
C
traceID = X
span ID = A
HTTP Request
processed
traceID = X
span ID = B
parent span ID =A
HTTP Request
processed
traceID = X
span ID = C
parent span ID =B
HTTP Request
processed
traceID = X
span ID = D
parent span ID =C
db query executed
traceID = X
span ID = E
parent span ID =A
HTTP Request
processed
traceID = X
span ID = F
parent span ID =E
HTTP Request
processed
traceID = X
span ID = G
parent span ID =F
redis query
executed
Internet

ANATOMY OF A TRACE

Single trace = multiple spans
Each span contains
Trace ID
Span ID
timestamp and duration
Parent span ID, if applicable
All the metadata with any
cardinality

STORAGE

PUBLISHING TRACES

TRACING IDS FORMAT
Propagated in headers (e.g.HTTP) or in
metadata/attributes (e.g. Kafka)
Newer:
Older:
W3C Trace Context
Zipkin's B3

TRACE PROPAGATION
Trace ID/Span ID propagation
exporting spans to storage

TRACE PROPAGATION IN SERVICES
your stack might be already doing that for you

TRACE PROPAGATION IN INFRASTRUCTURE

STORING TRACES

HEAD SAMPLING VS TAIL SAMPLING

HEAD SAMPLING
Decision made without looking at whole trace
Random sample of traffic - law of large numbers makes
its viable

TAIL SAMPLING
Decision made after considering all or most spans from
a trace
Allows for custom behaviour

TAIL SAMPLING

THROUGHPUT-BASED TAIL SAMPLING
samples up to defined number of spans per second

RULES-BASED TAIL SAMPLING
Storing traces matching custom policies
custom policies: spans with errors, slow traces, ignore
healthchecks/metric endpoints/websockets, etc.

DYNAMIC TAIL SAMPLING
Aims to have a representation of span attribute values
among collected traces in a timeframe
example: store 1 in 100 traces, but with representation
of values in attributes http.status_code,
http.method, http.route and service.name

MIXING SAMPLING METHODS
Sampling methods can and should be mixed

WHERE SAMPLING HAPPENS?

TYPICAL PROBLEMS IN TRACING

MISSING INSTRUMENTATIONS
investigation needed
add missing instrumentation
wrap with custom span

MISSING SPANS
tools are smart enough to detect that
buffers for spans are too small
...or you hit ratelimiter somewhere

GROWTH RULE OF THUMB

EXAMPLE

LAST TIP
Put trace and span ID's in logs

TOOLS
Storage and visualization:
Abstraction API:
Cloud providers:
AWS: X-Ray
Azure: Azure Monitor Logs
GCP: Cloud Trace
Zipkin
Jaeger
OpenTelemetry
Micrometer Tracing

WHAT IS OPENTELEMETRY
vendor neutral observability framework

EVENTS
aka "true" observability
aka Observability 2.0

WHAT ARE EVENTS?
Events == spans

WHAT'S DIFFERENT?
Events are the fuel
Columnar storage and query engine are the
superpower
They give ability to slice and dice data and discover
unknown unknowns

UNKNOWN UNKNOWNS
Things we don't know we don't know
Example: Migrating to OpenTelemetry and from logs to
traces

INSPIRATION
Scuba: Diving into Data at Facebook

THE O.G

How to Avoid Paying for Honeycomb

HONEYCOMB UI

THE NEW KID
Very comprehensive observability package
Distributed tracing module needs ClickHouse

COROOT DISTRIBUTED TRACING UI

DEBUGGING WITH HONEYCOMB
Based on Sandbox on their website

HEATMAP

SELECTION

BUBBLEUP

BUBBLEUP - BASELINE VS SELECTION

BUBBLEUP - ENHANCE!

BUBBLEUP - POSSIBILITIES

BUBBLEUP - CHOSEN ONE

BUBBLEUP - PARAMETER VALUE

BUBBLEUP - FILTERED HEATMAP

BUBBLEUP - SELECTING TRACE

CULPRIT TRACE

CULPRIT TRACE - ZOOM

CULPRIT TRACE - ZOOM

CULPRIT TRACE - SIMILAR SPANS

CULPRIT TRACE - SIMILAR SPANS

CULPRIT TRACE - DEPLOYMENT

CULPRIT TRACE - DEPLOYMENT

THINGS TO REMEMBER
Observability = iterative & continuous process

WHAT AFTER?
SLI and SLO
Alerting and on-call

THANK YOU!
THANK YOU!THANK YOU!THANK YOU!THANK YOU!THANK YOU!THANK YOU!THANK YOU!THANK YOU!THANK YOU!THANK YOU!THANK YOU!
QUESTIONS?