stackconf 2024 | IGNITE: Practical AI with Machine Learning for Observability by Costa Tsaousis.pdf

NETWAYS 58 views 20 slides Jul 25, 2024
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

Machine Learning for observability can be challenging, given the uniqueness of each workload. However, we can leverage ML to detect individual component anomalies, even if they are sometimes noisy/imprecise. At Netdata, we use ML models to analyze the behaviour of individual metrics. These models ad...


Slide Content

Practical AI
with Machine Learning
for Observability in Netdata
Costa Tsaousis

Page:
2
About Netdata
Netdata solves the granularity and cardinality
limitations of observability.

Cardinality = the number of distinct time-series
Granularity = the frequency samples are collected

Low Cardinality + Low Granularity
=
lacking detail and coverage
Next: distributed

Page:
3
Netdata’s Design
Netdata uses a fully distributed, completely
decentralized design.
Data are kept as close to the edge as possible.
A1 A2 A3 A4 A5
B1 B2 B3 B4 B5
C1 C2 C3 C4 C5
PA
PB
PC
Data Center 1 Data Center 2
Cloud Provider 1
Netdata Cloud
Dashboards
Alerting
Netdata Parent
Netdata Parent
Netdata Parent
Next: monitoring in a box

Page:
4
Distributed Observability Pipeline
Metrics Logs
Next: high fidelity

Page:
5
Netdata = High Fidelity
-Each node collects thousands of distinct time-series.
-All metrics are collected in high resolution (per-second).
-The database is stored at the local disk.

Then:
-Each node is a shard of a much larger collection.
-Each node db may be replicated to multiple other nodes.

Multi-node views are accomplished by querying multiple nodes in
parallel, and merging their responses on the fly.
Next: fully automated

Page:
6
Why this is important?
-Netdata auto-detects everything!
-Netdata automatically monitors everything!

-Since it monitors everything:
-Fully automated dashboards are provided!
-Fully automated alerts are also provided!

Because we all use
the same infrastructure components!

-Machine Learning gets to a new level!
Next: all their ML ideas were bad

Page:
Wednesday, 2 October, 2019
Todd Underwood, Google
The vast majority of proposed production engineering uses
of Machine Learning (ML) will never work. They are
structurally unsuited to their intended purposes. There are
many key problem domains where SREs want to apply ML
but most of them do not have the right characteristics to be
feasible in the way that we hope.

After addressing the most common proposed uses of ML
for production engineering and explaining why they won't
work, several options will be considered, including
approaches to evaluating proposed applications of ML for
feasibility. ML cannot solve most of the problems
most people want it to, but it can solve some
problems. Probably.
Google:

All of Our ML
Ideas Are Bad

(and We Should Feel Bad)

7
AI for observability is tricky
:URL
Next: what ML can do?

Page:
ML is the simplest way to model the behavior of
individual metrics.
Given enough past values of a metric, ML can tell us
if the value we just collected is an outlier or not.
We call this Anomaly Detection.
It is just a bit. True or False. Netdata stores it
together with the collected samples for every
time-series.
Over a period of time, we calculate the Anomaly
Rate, i.e. the % of samples of a time-series, found to
be anomalous in the given window.
Using ML we
can have a
simple and
effective way to
learn the
behavior of our
servers
8
What ML can do?
Next: when and how we train ML?

Page:
For each time-series:
Netdata trains a new model every 3 hours, using
the last 6 hours of data.
Netdata retains 18 models per time-series,
covering the last 57 hours (~2.5 days).
This provides:
●Rolling management of the models
Every 3 hours the oldest model is discarded and a new
one is generated.
●Robust anomaly detection
18 models have to agree to mark any collected sample as
an anomaly.
ML is
computationally
intense.

And we need a
rolling model…
9
When and how we train ML models?
Next: clusters across components

Page:
Since Netdata collects everything and trains ML
models for all of them:
We noticed anomalies happen in clusters!
Multiple time-series get anomalous at the same
time, in multiple infrastructure components!
CPU, Memory, Disk I/O may be affected, but also
Processes, Network, DNS, and many more!
So we defined:
Host Anomaly Rate, as the % of the time-series of
a server that are anomalous at the same time!
Anomalies
happen in
clusters,
affecting multiple
components on
each server!
10
Machine Learning gets to a new level!
Next: clusters across nodes

Page:
For interdependent servers:
Host Anomaly Rates across servers,
spike in clusters too!
Anomalies on one server, trigger anomalies on
another server, in a chain effect!
To explore all these efficiently, we needed a new
type of query engine…

Anomalies
happen in
clusters,
affecting multiple
interconnected
servers!
11
Machine Learning gets to a new level!
Next: scoring engine

Page:
Each Netdata agent is equipped with a scoring
engine.
The scoring engine accepts a timeframe and
returns an ordered list of all time-series.
So, it scores all metrics according to an algorithm,
for the given window.
Netdata uses the scoring engine for 3 features:
●Quickly spot what is anomalous
●Find correlations across all time-series
●Root cause analysis
Netdata comes
with a unique
scoring engine,
to analyze
anomalies across
all metrics.
12
Netdata’s Scoring Engine
Next: AR at the dashboard

Page:
13
Quickly spot what is anomalous
Time-frame picker
Anomaly rate
per section for
the time-frame
Anomaly Rate button
Next: a Netdata chart

Page:
14
A Netdata Chart - controls
Anomaly
rate ribbon
NIDL Controls - review data sources and slice/filter them
(NIDL = Nodes, Instances, Dimensions, Labels)
Aggregation
across time
Aggregation
across metrics
Info
ribbon
Dice
the data
Next: anomalies per node

Page:
15
A Netdata Chart - anomaly rate per node
Anomaly
rate per node
Next: anomalies per label

Page:
16
A Netdata Chart - anomaly rate per label
Anomaly
rate per label
Next: anomaly advisor

Page:
17
Anomaly Advisor - cross node anomalies
Percentage of
Host Anomaly
Rate
Number of metrics
concurrently
anomalous
Next: root cause analysis

Page:
18
Anomaly Advisor - root cause analysis results
Anomaly advisor
presents a sorted
list of all metrics,
ordered by their
anomaly rate,
during the
highlighted
time-frame.
Next: why Netdata

Page:
●High fidelity monitoring
All metrics are collected per second.
Thousands versus hundreds of metrics.
●Powerful visualization
Fully automated infrastructure level dashboards,
visualizing all metrics collected, offering slice and dice
capabilities on any dataset without a query language.
●Machine Learning
Unsupervised anomaly detection on every metric and a
scoring engine to explore anomalies.
●Out of the box alerts
Predefined alerts use rolling windows and statistical
functions to detect common issues, without fixed
thresholds.
From Zero
To Hero
Today!
19
Why Netdata
Next: thank you

Monitor your servers, containers, and
applications, in high-resolution, in real-time,
with true machine learning for all metrics!
netdata.cloud