stackconf 2024 | IGNITE: Practical AI with Machine Learning for Observability by Costa Tsaousis.pdf
NETWAYS
58 views
20 slides
Jul 25, 2024
Slide 1 of 20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
About This Presentation
Machine Learning for observability can be challenging, given the uniqueness of each workload. However, we can leverage ML to detect individual component anomalies, even if they are sometimes noisy/imprecise. At Netdata, we use ML models to analyze the behaviour of individual metrics. These models ad...
Machine Learning for observability can be challenging, given the uniqueness of each workload. However, we can leverage ML to detect individual component anomalies, even if they are sometimes noisy/imprecise. At Netdata, we use ML models to analyze the behaviour of individual metrics. These models adapt to the specific characteristics of each metric, ensuring anomalies can be detected accurately, even in unique workloads. The power of ML becomes evident when these seemingly noisy anomalies converge across various services, serving as indicators of something exceedingly unusual. ML is an advisor, training numerous independent models for each individually collected metric to achieve anomaly detection based on recent behaviour. When multiple independent metrics exhibit anomalies simultaneously, it is usually a signal that something unusual is occurring. This approach to ML can be instrumental in uncovering malicious attacks and, in many cases, predicting combined failures across seemingly unrelated components.
Size: 1.71 MB
Language: en
Added: Jul 25, 2024
Slides: 20 pages
Slide Content
Practical AI
with Machine Learning
for Observability in Netdata
Costa Tsaousis
Page:
2
About Netdata
Netdata solves the granularity and cardinality
limitations of observability.
Cardinality = the number of distinct time-series
Granularity = the frequency samples are collected
Page:
3
Netdata’s Design
Netdata uses a fully distributed, completely
decentralized design.
Data are kept as close to the edge as possible.
A1 A2 A3 A4 A5
B1 B2 B3 B4 B5
C1 C2 C3 C4 C5
PA
PB
PC
Data Center 1 Data Center 2
Cloud Provider 1
Netdata Cloud
Dashboards
Alerting
Netdata Parent
Netdata Parent
Netdata Parent
Next: monitoring in a box
Page:
4
Distributed Observability Pipeline
Metrics Logs
Next: high fidelity
Page:
5
Netdata = High Fidelity
-Each node collects thousands of distinct time-series.
-All metrics are collected in high resolution (per-second).
-The database is stored at the local disk.
Then:
-Each node is a shard of a much larger collection.
-Each node db may be replicated to multiple other nodes.
Multi-node views are accomplished by querying multiple nodes in
parallel, and merging their responses on the fly.
Next: fully automated
Page:
6
Why this is important?
-Netdata auto-detects everything!
-Netdata automatically monitors everything!
-Since it monitors everything:
-Fully automated dashboards are provided!
-Fully automated alerts are also provided!
Because we all use
the same infrastructure components!
-Machine Learning gets to a new level!
Next: all their ML ideas were bad
Page:
Wednesday, 2 October, 2019
Todd Underwood, Google
The vast majority of proposed production engineering uses
of Machine Learning (ML) will never work. They are
structurally unsuited to their intended purposes. There are
many key problem domains where SREs want to apply ML
but most of them do not have the right characteristics to be
feasible in the way that we hope.
After addressing the most common proposed uses of ML
for production engineering and explaining why they won't
work, several options will be considered, including
approaches to evaluating proposed applications of ML for
feasibility. ML cannot solve most of the problems
most people want it to, but it can solve some
problems. Probably.
Google:
All of Our ML
Ideas Are Bad
(and We Should Feel Bad)
7
AI for observability is tricky
:URL
Next: what ML can do?
Page:
ML is the simplest way to model the behavior of
individual metrics.
Given enough past values of a metric, ML can tell us
if the value we just collected is an outlier or not.
We call this Anomaly Detection.
It is just a bit. True or False. Netdata stores it
together with the collected samples for every
time-series.
Over a period of time, we calculate the Anomaly
Rate, i.e. the % of samples of a time-series, found to
be anomalous in the given window.
Using ML we
can have a
simple and
effective way to
learn the
behavior of our
servers
8
What ML can do?
Next: when and how we train ML?
Page:
For each time-series:
Netdata trains a new model every 3 hours, using
the last 6 hours of data.
Netdata retains 18 models per time-series,
covering the last 57 hours (~2.5 days).
This provides:
●Rolling management of the models
Every 3 hours the oldest model is discarded and a new
one is generated.
●Robust anomaly detection
18 models have to agree to mark any collected sample as
an anomaly.
ML is
computationally
intense.
And we need a
rolling model…
9
When and how we train ML models?
Next: clusters across components
Page:
Since Netdata collects everything and trains ML
models for all of them:
We noticed anomalies happen in clusters!
Multiple time-series get anomalous at the same
time, in multiple infrastructure components!
CPU, Memory, Disk I/O may be affected, but also
Processes, Network, DNS, and many more!
So we defined:
Host Anomaly Rate, as the % of the time-series of
a server that are anomalous at the same time!
Anomalies
happen in
clusters,
affecting multiple
components on
each server!
10
Machine Learning gets to a new level!
Next: clusters across nodes
Page:
For interdependent servers:
Host Anomaly Rates across servers,
spike in clusters too!
Anomalies on one server, trigger anomalies on
another server, in a chain effect!
To explore all these efficiently, we needed a new
type of query engine…
Anomalies
happen in
clusters,
affecting multiple
interconnected
servers!
11
Machine Learning gets to a new level!
Next: scoring engine
Page:
Each Netdata agent is equipped with a scoring
engine.
The scoring engine accepts a timeframe and
returns an ordered list of all time-series.
So, it scores all metrics according to an algorithm,
for the given window.
Netdata uses the scoring engine for 3 features:
●Quickly spot what is anomalous
●Find correlations across all time-series
●Root cause analysis
Netdata comes
with a unique
scoring engine,
to analyze
anomalies across
all metrics.
12
Netdata’s Scoring Engine
Next: AR at the dashboard
Page:
13
Quickly spot what is anomalous
Time-frame picker
Anomaly rate
per section for
the time-frame
Anomaly Rate button
Next: a Netdata chart
Page:
14
A Netdata Chart - controls
Anomaly
rate ribbon
NIDL Controls - review data sources and slice/filter them
(NIDL = Nodes, Instances, Dimensions, Labels)
Aggregation
across time
Aggregation
across metrics
Info
ribbon
Dice
the data
Next: anomalies per node
Page:
15
A Netdata Chart - anomaly rate per node
Anomaly
rate per node
Next: anomalies per label
Page:
16
A Netdata Chart - anomaly rate per label
Anomaly
rate per label
Next: anomaly advisor
Page:
17
Anomaly Advisor - cross node anomalies
Percentage of
Host Anomaly
Rate
Number of metrics
concurrently
anomalous
Next: root cause analysis
Page:
18
Anomaly Advisor - root cause analysis results
Anomaly advisor
presents a sorted
list of all metrics,
ordered by their
anomaly rate,
during the
highlighted
time-frame.
Next: why Netdata
Page:
●High fidelity monitoring
All metrics are collected per second.
Thousands versus hundreds of metrics.
●Powerful visualization
Fully automated infrastructure level dashboards,
visualizing all metrics collected, offering slice and dice
capabilities on any dataset without a query language.
●Machine Learning
Unsupervised anomaly detection on every metric and a
scoring engine to explore anomalies.
●Out of the box alerts
Predefined alerts use rolling windows and statistical
functions to detect common issues, without fixed
thresholds.
From Zero
To Hero
Today!
19
Why Netdata
Next: thank you
Monitor your servers, containers, and
applications, in high-resolution, in real-time,
with true machine learning for all metrics!
netdata.cloud