WHAT IS
WHAT ISWHAT ISWHAT ISWHAT ISWHAT ISWHAT ISWHAT ISWHAT ISWHAT ISWHAT ISWHAT IS
OBSERVABILITY?
OBSERVABILITY?OBSERVABILITY?OBSERVABILITY?OBSERVABILITY?OBSERVABILITY?OBSERVABILITY?OBSERVABILITY?OBSERVABILITY?OBSERVABILITY?OBSERVABILITY?OBSERVABILITY?
THE DEFINITION
“In control theory, observability is a
measure of how well internal states of a
system can be inferred from knowledge of
its external outputs”
Wikipedia
LET'S ASK THE ORACLE
WHAT FOR?
Way of determining state of the system
Observe trends
Spot anomalies
Debug errors
Gather data to support decision process
Measure user experience
LOGGING PRIME DIRECTIVE
Don't use System.out.println()
LOGGING LIBRARY ECOSYSTEM
DEFINE LOGGER
public class UserService {
private static final org.slf4j.Logger LOGGER =
org.slf4j.LoggerFactory.getLogger(UserService. class);
(...)
}
USE LOGGER
private int myMethod(int x) {
LOGGER.info("running my great code with x={}" , x);
int y = methodX(x);
LOGGER.info("my great code finished with y={}" , y);
return y;
}
2023-03-21 19:24:43,829 [main] INFO UserService - running my gre
2023-03-21 19:24:43,829 [main] INFO UserService - my great code
"TALKING TO A VOID" DEBUGGING
private void thisCallsMethodX() {
LOGGER.info("PLEASE WORK");
methodY();
LOGGER.info("SHOULD HAVE WORKED" );
}
DOS
Log meaningful checkpoints
Use correct logging levels
Add ids to logging context:
e.g. correlation id, user id
High-throughput = async appender
DO NOTS??????
Logs can be lost
Logs are not audit trail
Logs are not metrics
Logs are not data warehouse/lake
Alerts based on logs are flimsy - metrics-based alerts
are better
WHAT NOT TO LOG
personal information, secrets, session tokens etc.
(thanks )
OWASP Logging Cheat sheet
@piotrprz
METRICS
numbers
indexed over time
and multiple additional dimensions
=== timeseries
STORAGE
timeseries database
PUBLISHING METRICS
FORMAT
Storage specific
Vendor neutral: OpenTelemetry
METRICS TYPES
Prometheus
COUNTER
Used for: requests count, jobs count, errors
GAUGE
Used for memory usage, thread/connection pools
count, etc.
Tip: Store actual current and maximum value, not the
percentages
HISTOGRAM
used for request latencies and sizes
can calculate any percentiles via query
tip: preconfigured bucket sizes = performance better
SUMMARY
Histogram - but with percentiles (e.g. p50, p75, p95)
Calculated client side - lighter then histograms
Caveat: calculated per scrape target
METRIC ATTRIBUTES
METRIC RESOLUTION
Main dimension
How often metrics are probed
Long-term storage = Down sampling/roll up
1min up to 7 days -> 10 min upto 30 days, etc.
YMMV - think about business cycles (e.g. Black Friday)
METRIC TAGS/LABELS
Additional dimensions
Metadata: HTTP attributes, service name, hosts, cloud
regions, etc.
Metrics labels don't like high cardinality
route="/users/:id" ✅
route="/users/2137" ❌
WHAT TO MEASURE?
Kirk Pepperdine's The Box model
Your framework/libraries cover some basics already
HOW TO MEASURE?
Latency
Traffic
Errors
Saturation
The Four Golden Signals:
MEASUREMENT METHODS
RED vs USE
RED
Rate
Error
Duration
Application focused
USE
Utilization
Saturation
Errors
Infrastructure focused
ONE METRICS TO RULE THEM ALL
APDEX
https://www.apdex.org/
More user experience focused - mesures satisfaction
Single number
APDEX - SPLITTING THE POPULUS
Minimal sample size 100 - adjust time window
Measure as close to the user as possible
APDEX - CALCULATIONS
Result is a decimal between 0 and 1
APDEX - INTERPRETING THE RESULT
0.94 ≤ X ≤ 1 : excellent
0.85 ≤ X ≤ 0.93 : good
0.70 ≤ X ≤ 0.84 : fair
0.50 ≤ X ≤ 0.69 : poor
0.00 ≤ X ≤ 0.49 : unacceptable
APDEX - CAVEATS
Very generic metric - hides details
Should be monitored closely after deployments
Should measure one functionality
Shouldn't be only metric of application success
PERCENTILES
Latency distribution is not normal/Gaussian
Use high percentiles (p99+) - not averages or means
(p50)
DIFFERENCE OF OPINION?
COORDINATED OMISSION
1. Server mesures in wrong place
2. Requests can be stuck waiting for processing
Measure both clients and servers - if possible
"How not to measure latency" by Gil Tene
DEATH BY METRICS
Storing unnecessary amount of metrics
Made possible by automations
Bad for infrastructure and cloud bills
Bad for your mental health
ASSUMPTIONS ABOUT METRICSWRONG
1. Metrics can't be lost
2. Metrics are precise
TRACING
A
B
C
traceID = X
span ID = A
HTTP Request
processed
traceID = X
span ID = B
parent span ID =A
HTTP Request
processed
traceID = X
span ID = C
parent span ID =B
HTTP Request
processed
traceID = X
span ID = D
parent span ID =C
db query executed
traceID = X
span ID = E
parent span ID =A
HTTP Request
processed
traceID = X
span ID = F
parent span ID =E
HTTP Request
processed
traceID = X
span ID = G
parent span ID =F
redis query
executed
Internet
ANATOMY OF A TRACE
Single trace = multiple spans
Each span contains
Trace ID
Span ID
timestamp and duration
Parent span ID, if applicable
All the metadata with any
cardinality
STORAGE
PUBLISHING TRACES
TRACING IDS FORMAT
Propagated in headers (e.g.HTTP) or in
metadata/attributes (e.g. Kafka)
Newer:
Older:
W3C Trace Context
Zipkin's B3
TRACE PROPAGATION
Trace ID/Span ID propagation
exporting spans to storage
TRACE PROPAGATION IN SERVICES
your stack might be already doing that for you
TRACE PROPAGATION IN INFRASTRUCTURE
STORING TRACES
HEAD SAMPLING VS TAIL SAMPLING
HEAD SAMPLING
Decision made without looking at whole trace
Random sample of traffic - law of large numbers makes
its viable
TAIL SAMPLING
Decision made after considering all or most spans from
a trace
Allows for custom behaviour
TAIL SAMPLING
THROUGHPUT-BASED TAIL SAMPLING
samples up to defined number of spans per second
RULES-BASED TAIL SAMPLING
Storing traces matching custom policies
custom policies: spans with errors, slow traces, ignore
healthchecks/metric endpoints/websockets, etc.
DYNAMIC TAIL SAMPLING
Aims to have a representation of span attribute values
among collected traces in a timeframe
example: store 1 in 100 traces, but with representation
of values in attributes http.status_code,
http.method, http.route and service.name
MIXING SAMPLING METHODS
Sampling methods can and should be mixed
WHAT'S DIFFERENT?
Events are the fuel
Columnar storage and query engine are the
superpower
They give ability to slice and dice data and discover
unknown unknowns
UNKNOWN UNKNOWNS
Things we don't know we don't know
Example: Migrating to OpenTelemetry and from logs to
traces
INSPIRATION
Scuba: Diving into Data at Facebook
THE O.G
How to Avoid Paying for Honeycomb
HONEYCOMB UI
THE NEW KID
Very comprehensive observability package
Distributed tracing module needs ClickHouse
COROOT DISTRIBUTED TRACING UI
DEBUGGING WITH HONEYCOMB
Based on Sandbox on their website
HEATMAP
SELECTION
BUBBLEUP
BUBBLEUP - BASELINE VS SELECTION
BUBBLEUP - ENHANCE!
BUBBLEUP - POSSIBILITIES
BUBBLEUP - CHOSEN ONE
BUBBLEUP - PARAMETER VALUE
BUBBLEUP - FILTERED HEATMAP
BUBBLEUP - SELECTING TRACE
CULPRIT TRACE
CULPRIT TRACE - ZOOM
CULPRIT TRACE - ZOOM
CULPRIT TRACE - SIMILAR SPANS
CULPRIT TRACE - SIMILAR SPANS
CULPRIT TRACE - DEPLOYMENT
CULPRIT TRACE - DEPLOYMENT
THINGS TO REMEMBER
Observability = iterative & continuous process