stackconf 2024 | Netdata: Open Source, Distributed Observability Pipeline – Journey and Challenges by Costa Tsaousis
NETWAYS
37 views
65 slides
Jul 02, 2024
Slide 1 of 65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
About This Presentation
Netdata is a powerful open-source, distributed observability pipeline designed to provide higher fidelity, easier scalability, and a lower cost of ownership compared to traditional monitoring solutions. This presentation will offer an in-depth overview of the journey we’ve undertaken in building N...
Netdata is a powerful open-source, distributed observability pipeline designed to provide higher fidelity, easier scalability, and a lower cost of ownership compared to traditional monitoring solutions. This presentation will offer an in-depth overview of the journey we’ve undertaken in building Netdata, highlighting the challenges we’ve faced and the innovative solutions we’ve developed to address them. In this presentation, we will delve into the history of Netdata, starting from its inception. We’ll discuss the initial goals of creating a monitoring tool that could offer high-resolution metrics, auto-detection of metrics, and real-time visualization, all with minimal configuration required. We’ll also explore how Netdata garnered rapid attention and support from the open-source community.
Size: 3.23 MB
Language: en
Added: Jul 02, 2024
Slides: 65 pages
Slide Content
Netdata
the open-source observability
platform everyone needs
Costa Tsaousis
Page:
2
About Netdata
●Born out of a need
While migrating a large infra from on-prem to cloud, we were facing unexplainable issues -
existing monitoring solutions (open-source and commercial) failed to diagnose, or even
surface.
●Born out of curiosity
Experimented to understand if and how a monitoring solution could be real-time, high-
resolution, easier, simpler and out of the box.
●Born on GitHub, open-source from the very beginning
The love and adoption by the community gave substance
and a future to Netdata.
GitHub URL: https://github.com/netdata/netdata
Page:
3
Why another tool?
●Low Fidelity Insights
The design of most tools force users to lower granularity and cardinality, resulting in abstract views,
that are not useful in revealing the true dynamics of the infrastructure.
●Inefficient
They enforce a development lifecyclein setting up and maintaining observability, requiring advanced
skills, a lot of disciplineand huge amounts of timeand effort.
●No AI and Machine Learning
Most observability solutions are pretty dump. And even when they say they are smarter, they usually use
tricks, not true machine learning.
●Expensive
Observability is frequently more expensive than the cost of the infrastructurethat is being
monitored.
Page:
4
The Current State of Observability
●Too little observability
Traditional check-based systems, like Nagios, Icinga, Zabbix, Sensu, PRTG, CheckMk, SolarWinds,
etc. have evolved around the idea of checks or sensors. Each check has a status, probably annotated with
some text and a few metrics. Checks are executed every minute. Equivalent to “traffic lights” monitoring.
Reliable and robust, but not enough for today's needs, suffering from extensive “blind spots”.
●Too complex observability
DIY platforms, like theGrafanaecosystem, have evolved around a much better design (metrics, logs,
traces) and although they are more powerful and highly customizable, they introduce way too many
moving parts, require significant skills and expertise, have a steep learning curve, and they quickly become
overcomplicated to maintain and scale.
●Too expensive observability
Commercial vendors, like Datadog, Dynatrace, NewRelic, Instana, each with its own strengths and
weaknesses, although sometimes they are easier and better (to a different extend each), they are
unrealistically expensive.
Page:
5
Observability Generations
1st
generation
Checks
2nd
generation
Metrics
3rd
generation
Logs
5th
generation
Distributed
Check it is there
and works
4rd
generation
Integrated
Sample it
periodically
Collect and
index its logs
All in One All in One,
Real-Time,
High-Fidelity,
AI-Powered,
Cost-Efficient
Nagios, Icinga,
Zabbix, Sensu,
CheckMk, PRTG,
SolarWinds
Graphite,
OpenTSDB,
Prometheus,
InfluxDB
ELK
Splunk
Datadog,
Dynatrace,
New Relic,
Instana,
Grafana
Netdata
Blog post for more details:
Attributes of the ideal observability
solution…
Page:
7
Goal: Monitor Everything!
Bad “Best Practice”:
Monitor only what you need and understand.
Why this is a myth:
1.Holistic View
To understand the dynamics of modern infrastructure we need comprehensive and holistic views of the technologies
involved. Cherry picking data usually means blind spots.
2.Easier and Faster Root Cause Analysis
When issues arise, having access to a broad set of data helps in pinpointing the source of the problem.
3.Proactive Issue Detection
With a full spectrum of monitoring, patterns and anomalies can be identified early. Such issues may not be evident when
only a subset of the data is considered.
4.Adaptability to Changing Environments
Modern IT environments are dynamic and constantly evolving. What is considered important today may change
tomorrow due to shifts in application architecture, user behavior, or infrastructure changes.
Page:
8
Goal: Real-Time (per-second) & Low-Latency!
Bad “Best Practice”:
Monitoring every 10, 15 or 60 seconds.
Why this is a myth:
1.True Performance Monitoring
Virtualized environments are highly dynamic, non-linear, and unpredictable. Monitoring per second reveals the true
performance characteristics of the infrastructure and applications by capturing the fine-grained details of their behavior.
2.Improved Correlation and Context
Having data points every second enhances the ability to correlate events and metrics across different parts of the system.
This improved correlation helps in understanding the context of issues, such as determining the sequence of events
leading up to a failure or performance degradation.
3.Faster Response to Issues
The effects of your corrective actions, or even configuration changes, are immediately reflected on the monitoring data,
enabling quicker identification and mitigation of problems.
Page:
9
Goal: Powerful Fully Automated Dynamic Dashboards!
Bad “Best Practice”:
Design all the custom dashboards you need, beforehand.
Why this is a myth:
1.Infinite Correlations
The correlations between data vary depending on the issue at hand. Having dashboards with fixed correlations, makes
you happy, but they are not that helpful at crisis.
2.Reduced Development Lifecycle Overhead
When issues arise, designing a new custom dashboard means time, effort and skills. Instead of focusing on understanding
and solving the problem, you focus on configuring the monitoring system.
3.Clarity over Versioning and Adaptation Chaos
Multiple versions of similar dashboards increase noise and confusion, especially at the time you need clarity.
4.Seamless Exploration and Flexibility
Fully automated dynamic dashboards enable users to easily slice, dice, and correlate data in any way they see fit without
needing a deep understanding of the data or the queries involved.
Page:
10
Goal: AI-powered Correlations!
Bad “Best Practice”:
Machine learning (ML) cannot help in observability.
Why this is a myth:
1.ML Can Learn the Behavior of Metrics
ML models can establish baselines and detect deviations from these patterns.
2.ML Can Be Trained at the Edge (and be very fast)
Edge training of ML models enables localized learning and decision-making.
3.ML Can Reliably Detect Outliers
ML excels at outlier detection by identifying data points that significantly deviate from the established norm.
4.ML Can Reveal Hidden Correlations
By identifying seemingly unrelated metrics that get anomalous together.
5.Density of Outliers and Concurrent Anomalous Time-Series Provide Key Insights
The density of outliers in a time-series and the number of concurrently anomalous time-series are critical indicators of
underlying issues.
Netdata
Distributed Design,
For High-Fidelity monitoring!
Page:
12
The design of Observability keeps fidelity low
●What affects fidelity?
Granularity (the resolution of the data) and cardinality (the number of entities monitored), control the
amount of data a monitoring system must ingest, process, store and retain.
Low granularity = blurry data, not that detailed
Low cardinality = blind spots, not everything is monitored
Low granularity + low cardinality = abstract view of the infrastructure, lacking detail and coverage
●Why not having high fidelity?
Centralization is the key reason for keeping granularity and cardinality low.
Example: a system monitoring 3000 metrics every second (3k samples/s), has to process, store and query
450x more datacompared to a system monitoring 100 metrics every 15 seconds
(<7 samples/s).
Centralization makes fidelity and cost proportional to each other; increasing fidelity results in higher costs,
and reducing costs leads to decreased fidelity.
Page:
13
Decentralized Design For High Fidelity
●Keep data at the edge
By keeping the data at the edge:
○Compute & storage resources are already available and spare
○No need for network resources
○The work to be done is small and it can be optimized,
so that monitoring is a “polite citizen” to production applications
●Multiple independent centralization points
Mini centralization points may exist, as required for operational needs:
○Ephemeral nodes, that may vanish at any point in time
○High availability of observability data
○Offloading “sensitive” production systems from observability work
●Unify and integrate everything at query time
To provide unified infrastructure-wide views, query edge systems (or the mini centralization points), aggregate their
responses and provide high-resolution, real-time dashboards and alerts.
Page:
14
Common Concerns about Decentralized Designs
●The agent will be heavy
No! The Netdata agent, that is a complete monitoring pipeline in a box, processing may thousands of metrics per second (vs.
others that process just a few hundreds every minute), is one of the lightest observability agents available.
●Queries will influence production systems
No! Each agent serves only its own data. Querying such a small dataset is lightweight and does not influence operations. For
very sensitive or weak production systems, a mini-centralization point next to these systems will isolate them from queries
(and also offload them from ingestion, processing, storage and retention).
●Queries will be slower
No! They are actually faster! Distributing tiny queries in parallel to multiple systems, provides an aggregate compute power that
is many times higher to what any single system can provide.
●Will require more bandwidth
No! Querying is selective, most of the observability data are never queried unless required for exploration or troubleshooting.
And even then, just a small portion of the data is examined.
So, the overall bandwidth used is a tiny fraction compared to centralized systems.
Page:
●150+ dashboard charts, 2k+ unique time-series
CPU, Memory, Disks, Mount Points, Filesystems, Network Interfaces, the whole Networking Stack for all
Protocols, Firewall, systemd Units, Containers, Processes, Users, User Groups and more.
All collected and visualized with 1-second granularity.
●50+ unique alerts
Checking the health of every single component the system has, for common errors, misconfigurations, and
error conditions.
●systemd-journal logs explorer
Visualize the analyze system and application logs, by directly querying systemd journals, without the need of a
logs database server.
●Network explorer
Visually explore all TCP and UDP network connections, from all processes and containers.
●Unsupervised anomaly detection
18 Machine Learning models are trained for each time-seriescollected, modeling the behavior of each metric
over the last few days, offering outlier detection in real-time, during data collection.
●2 years of retention using 3GiB of disk space
2 weeks of high-res (per-sec), 3 months of mid-res (per-min), 2 years of low-res (per-hour).
●1% CPU of a single core, 120MB RAM, almost zero disk I/O
Optimized for speed and efficiency, a nice companion for production systems and applications.
What you get
by just
installing
Netdataon
an emptyVM
15
Netdata on an emptyVM
Page:
16
Distributed Metrics Pipeline
The Netdata
Metrics Pipeline
is like lego
building blocks
High-resolution tier at
~0.5 bytes per sample on
disk.
Multiple tiers provide
efficient storage for years
of retention.
Page:
17
The Netdata way -standalone
S1 S2 S3 S4 S5
Dashboards
Metrics & Logs
Alerting
Install the agent on all your systems.
Servers
Page:
18
The Netdata way -distributed monitoring
S1 S2 S3 S4 S5
Cloud Provider 1
Netdata Cloud
SaaS
Dashboards
Alerting
Install Netdata on all your systems,
use Netdata Cloud to access your infra.
Servers
Netdata Cloud (NC) acts like a smart proxy.
Your data is always inside your servers.
NC queries your servers to show
dashboards.
Page:
19
The Netdata way -centralization points
S1 S2 S3 S4 S5
Ephemeral or Sensitive Servers
Dashboards
Alerting
Netdata Parent is the
same software with
Netdata Agents.
S6
Netdata Parent
(centralization point)
●Multiple independent centralization points
Users may create any number of independent observability
centralization points within an infrastructure, as required by their
operational needs:
○Ephemeral Nodes
Maintain observability data of ephemeral nodes, that may
stop being available, due to auto-scaling.
○High Availability
Replicate observability data.
○Production Systems Isolation
For offloading sensitive production systems from
observability work.
○Security
For protecting production systems, Netdata parents can
act as a session border controller when placed in a DMZ.
Page:
20
The Netdata way -hybrid cloud
A1 A2 A3 A4 A5
B1 B2 B3 B4 B5
C1 C2 C3 C4 C5
PA
PB
PC
Data Center 1 Data Center 2
Cloud Provider 1
Netdata Cloud
Dashboards
Alerting
Netdata Parent
Netdata Parent
Netdata Parent
Page:
●-35% CPU Utilization
Netdata: 1.8 CPU cores per million of metrics/s
Prometheus: 2.9 CPU cores per million of metrics/s
●-49% Peak Memory Consumption
Netdata: 49 GIB
Prometheus: 89 GiB
●-12% Bandwidth
Netdata: 227 Mbps
Prometheus: 257 Mbps
●-98% Disk I/O
Netdata: 3 MiB/s (no reads, 3 MiB/s writes)
Prometheus: 129 MiB/s (73.6 MiB/s reads, 55.2 MiB/s writes)
●-75% Storage Footprint
Netdata: 10 days per-sec, 43 days per-min, 467 days per-hour
Prometheus: 7 days per-sec
Stress tested
Netdataparent
and Prometheus
with 500 servers,
40k containers,
at 2.7 million
metrics/s
21
Netdata vs Prometheus
:full comparison URL
Page:
In December 2023, University of Amsterdampublished a study related
to the impact of monitoring tools for Docker based systems, aiming to
answer 2 questions:
●What is the impact of monitoring tools on the energy efficiencyof
Docker-based systems?
●What is the impact of monitoring tools on the performanceof
Docker-based systems?
They found that:
-Netdata is the most efficient agent,
Requiring significantly less system resources than the others.
-Netdata is excellent in terms of performance impact,
Allowing containers and applications to run without any measurable
impact due to observability.
Outperforming
other monitoring
solutions in edge
resources
efficiency!
22
Netdata is the most lightweight platform!
:full comparison URL
Page:
●Netdata
68.7k stars
The open-source observability platform everyone needs!
●Elasticsearch
68.3k stars
Free and Open, Distributed, RESTful Search Engine.
●Grafana
61.1k stars
The open and composable observability and data visualization platform. Visualize metrics,
logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB,
Postgres and many more.
●Prometheus
53.4k stars
The Prometheus monitoring system and time series database.
●Jaeger
19.7k stars
CNCF Jaeger, a Distributed Tracing Platform.
Netdata is
leading the
observability
category in the
CNCF landscape,
in terms of
users’ love.
23
Netdata in CFNF
Disclaimer: Netdata is a member of, and supports CNCF, but the project is not endorsed nor incubating in CNCF (because we don’t want to).
This list indicates only users’ love, expressed as Github stars.
Netdata
Challenge 1:
from zero to hero, today!
Page:
●Similar physical or virtual hardware
We all use a finite set of physical and virtual hardware. This
hardware may be different in terms of performance and capacity,
but the technologies involved are standardized.
●Similar operating systems
We all use flavors of a small set of operating systems, exposing a
finite set of metrics covering the monitoring of all system
components and operating system layers.
●Packaged applications
Most of our infrastructure is based on packaged applications, like
web servers, database servers, message brokers, caching servers,
etc.
●Standard libraries
Even for our custom applications, we usually rely on packaged
libraries that expose telemetry in a finite and predictable way.
Since we have
so much in
common, why it
takes so long to
set up a
monitoring
solution?
25
We have a lot in common
Page:
●Nodes
The node the data are coming from.
●Contexts
The kind of metrics, like “disk.io”, “cgroup.cpu”, “nginx.requests”.
Equivalent with the metric name in Prometheus.
●Instances
The unique instances of the things being monitored. For example
“/dev/sda” and “/dev/sdb” are two instances (disks) for which we
monitor “disk.io” (context).
Equivalent of a combination of some of the labels of a metric in Prometheus.
●Dimensions
The attributes of the instances monitored. For example “read” and
“write” are Dimensions of the “disk.io” context, of the “/dev/sda”
instance (disk). Dimensions get values and maintain a time-series.
Equivalent of a unique time-series in Prometheus.
NIDL stands for:
-Nodes
-Instances
-Dimensions
-Labels
The name comes from the slicing
and dicing controls on all Netdata
charts.
26
NIDL, the model for rapid deployment
Page:
27
NIDL: how it looks?
Context
(disk.io)
Every chart on the dashboard is a context,
aggregating all the instancesfrom all nodes
selected, for the visible timeframe.
Dashboard configuration is done per context.
Example: apply units to all dimensions, set default dicing
settings, provide additional information to help users
understand what they see, and more.
Instance
(/dev/sda)
Instance
(/dev/sdb)
Instance
(/dev/sdc)
read write read write read write
Alertsare configured for contexts, but they are
applied to instances.
Example: “apply this disk.io alert to all disks” or even
“apply this disk.io alert to all NVME disks”.
Instanceshave labels, and alertvariables
related to the component they refer to (for
example, disks have a model, a serial number, a
kind, etc).
Dimensions have values, time-series data.
Variables lookup is
smartto match the
component the
alert is linked to.
Page:
●Fully automated visualization
Netdata visualizes all metrics in a meaningful way, by correlating the
right dimensions together and applying the right settings for each
case.
●Fully automated alerts
Netdata comes with preconfigured alerts for 350 unique
components and applications. All these alert templates are applied
automatically to the right components, applications and charts.
●Slice and dice easily from the UI
The NIDL framework enables easy slicing and dicing of the data
from the UI, without the need to learn a query language.
Netdata
incorporates all
the knowledge
and skills
required to set
up a monitoring
system
29
NIDL: the result
Page:
●Just install it
One moving part: Netdata. Batteries included!
(i.e. data collection plugins and all needed modules are shipped with Netdata).
●Auto-discovers all metrics
Data collection does not require configuration, unless the monitored data are
password protected (Netdata needs the password).
Data collection plugins provide metrics with the NIDL framework embedded into
them.
●Fully automated visualization
Dashboards are available 1 second after Netdata starts.
●Pre-configured alerts
If something is wrong, an alert will fire-up just a few seconds after installation.
●High Fidelity
High granularity: per second data collection and visualization as a standard.
High cardinality: the more metrics the better the troubleshooting gets.
Designed to be
installed mid-
crisis!
30
Mission accomplished!
Netdata
Challenge 2:
get rid of the query language
for slicing and dicing data
Page:
●Since users haven’t configured the metrics
themselves, can we provide a UI that can explain
what users see?
●How users will be able to slice and dice the data on
any chart, the way it makes sense for them?
Netdata
collects a vast
number of
metrics you will
probably see for
the first time
32
Slice and dice from the UI
Page:
33
A Netdata Chart
Netdata Cloud Live Demo URL::Netdata Parent URL
Page:
34
Info Button: Help about this chart
Info button includes links to relevant documentation
and/or some helpful message about the metrics on each
chart.
Page:
35
A Netdata Chart -controls
Anomaly
rate ribbon
NIDL Controls -review data sources and slice/filter them
(NIDL = Nodes, Instances, Dimensions, Labels)
Aggregation
across time
Aggregation
across metrics
Info
ribbon
Dice
the data
Page:
36
A Netdata Chart -anomaly rate per node
Instances per Node
contributing to this chart
Unique time-series per Node
contributing to this chart
The visible volume each Node
is contributing to this chart
The anomaly rate each Node
contributes to this chart
Clicked
on
Nodes
The minimum, average and maximum
values across all metrics this Node
contributes
Similar analysis is available per Instance
(“application” in this chart), dimensions, and
labels.
Filter Nodes
contributing data to this chart
Page:
37
Dicing any chart, without queries
Result: dimension,device_type
Page:
38
Info Ribbon: Missing data collections
A missed data collection is a gap, because something is wrong!
Netdata does not smooth out the data.
Page:
The Netdata query engine, does all the calculations, for all drop
down menus and ribbons in one goand returns everything in a
single query response.
All queries, include all information needed:
-Per Node
-Per Instance (disk, container, database, etc)
-Per Dimension
-Per Label Key
-Per Label Value
Providing:
-Availability of the samples (gaps), over time
-Min, Average and Maximum values
-Anomaly Rate for the visible timeframe
-Volume contributing to the chart
-Number of Nodes, Instances, Dimensions, Label Keys, Label Values
matched
All this
additional
information is
available on
every query,
every chart,
every metric!
39
Mission accomplished!
Netdata
Challenge 3:
make machine learning and anomaly
detectionuseful for observability
Page:
Wednesday, 2 October, 2019
Todd Underwood, Google
The vast majority of proposed production engineering uses of
Machine Learning (ML) will never work. They are structurally
unsuited to their intended purposes. There are many key
problem domains where SREs want to apply ML but most of
them do not have the right characteristics to be feasible in the
way that we hope. After addressing the most common proposed
uses of ML for production engineering and explaining why they
won't work, several options will be considered, including
approaches to evaluating proposed applications of ML for
feasibility. ML cannot solve most of the problems most
people want it to, but it can solve some problems. Probably.
Google:
All of Our ML
Ideas Are Bad
(and We Should Feel Bad)
41
AI for observability is tricky
:URL
Page:
●Trains a ML model per metric,
every 3 hours, using the last 6
hours of data of each metric.
●Maintains 18 ML modelsper metric,
covering the last few days.
●Detects anomalies in real-time,
while data are being collected every second.
●All available ML models for a metric need to agree that
a collected sample is an outlier, for Netdata to consider
it an Anomaly.
●Stores Anomaly Rate together with collected data.
●Calculates Host-level Anomaly Score.
Default
ML Configuration
In Netdata
42
What does Netdata do with Machine Learning?
Page:
●A scoring engine, a unique feature across
all monitoring systems.
●All metrics, independently of their context,
can be scored across time, based on various
parameters, including their anomaly rate.
●Metrics correlationsis a subset of the
scoring engine, that can score metrics based
on their rate of change (data), anomaly rate
of change (anomaly rate), but also based on
volume, similarity, and more.
Netdata can score
all metrics based
on their anomaly
rate for any given
time-frame!
43
Netdata’s scoring engine
Page:
44
A Netdata dashboard
One fully automated
dashboard, with infinite
scrolling, presenting and
grouping all metrics available.
Quick access to all sections
using the index on the right.
Multi-dimensional data on
every chart, using chart
controls to slice and dice any
dataset.
AI assisting on every step.
Page:
45
A Netdata Dashboard -what is anomalous?
Time-frame picker
Anomaly rate
per section for
the time-frame
Anomaly Rate button
Page:
●Uses Host Anomaly Rateto identify
durations of interest.
●Host Anomaly Rateis the percentage of
the metrics of a host, that were found to be
anomalous concurrently.
●So, 10% host anomaly rate, means that
10% of all the metrics the host exposes,
were anomalous at the same time, showing
the spread of an anomaly.
Anomaly advisor
assists in finding
the needle in the
haystack.
46
Anomaly Advisor
Page:
47
Anomaly Advisor -starting point
Percentage of
Host Anomaly
Rate
Number of metrics
concurrently
anomalous
Page:
48
Anomaly Advisor -triggering the analysis
Highlighting an area
on the chart, triggers
the analysis
Page:
49
Anomaly Advisor -the analysis
Anomaly advisor
presents a sorted
list of all metrics,
ordered by their
anomaly rate,
during the
highlighted time-
frame.
Page:
Netdata turns AI to a consultant that can help you spot what is
interesting, what is related, what needs your attention.
●Unsupervised
There are plenty of settings, but it just works behind the scenes,
learning how metrics behave and providing an anomaly score for
them.
●It is just another attribute for each of your metrics
Anomaly Rate is stored in the metrics database together with every
sample collected, making it possible to query the past for anomalies.
●Can detect the spread of an anomalyacross systems
and applications.
●Can assist finding the aha! momentwhile
troubleshooting.
Unsupervised
Anomaly
Detection is an
advisor!
50
Mission accomplished!
Netdata
Challenge 4:
Make logs exploration and analytics,
easy and affordable.
Page:
●Is available everywhere!
We use it already, even when we don’t realize it.
●Is secure by design!
○FSS, to seal the logs
○Survives disk failures (uses tmpfs)
○Its file format is designed for minimal data loss on disk
corruption
●Is unique!
○Supports any number of fields, even per log entry
(think huge cardinality)
○Indexes all fields provided
○Queries on any combination of fields
○Maintenance free -just works!
●Amazing ingestion performance!
●Can build logs centralization points
It provides all the tools and processes to centralize all the logs of an
infra to a central place.
systemd-journald
is a hidden gem,
that already lives
in our systems!
52
Systemd-journald
Page:
53
Netdata systemd-journal Logs Explorer
Page:
●Yes and No.
The query performance issues are simple implementation glitches,
easy to fix.
●We submitted patches to systemd
We analyzed journalctland found several issues that once fixed
they improve query performance 14x.
We submitted these patches to systemd.
●Netdata systemd-journal Explorer
We managed to bypass all the performance issues systemd-journal
has, independently of the version of systemd installed on a system.
Netdata is fast when querying systemd-journal logs on all systems,
even with a slow systemd-journal and journalctl.
systemd-journal
is not slow when
used with
Netdata
54
Systemd-journald: it is slow to query
Page:
●Yes it did.
Generally, very few tools are available to push structured logs to
systemd-journals.
●Netdata log2journal
We released log2journal, a powerful command line tool to
convert any kind of log into structured systemd-journal entries.
Think of it as the equivalent to promtail.
For jsonand logfmtformatted logs, almost zero configuration is
needed.
●Netdata systemd-cat-native
We released systemd-cat-native, a tool similar to the
standard systemd-cat, which however allows sending a stream
of entries formatted in the systemd-journal native format to a local
or remote systemd-journald.
The value of a
logging system
depends on its
integrations
55
Systemd-journald: it lacks integrations
:URL for log2journal
Page:
Netdata provides all the tools and dashboards to explore and
analyse your system and applications logs, without actually
requiring a dedicated logs database server.
Despite the storage requirements of systemd-journald, the tool
is amazing, especially for developers, since it provides great
flexibility and troubleshooting features.
Even if you don’t want to push your traefik, haproxy or nginx
access logs to it due to its storage requirements, we strongly
recommend to use it for application error logs and
exceptions. Your troubleshooting efforts can become a lot
simpler with this environment.
Netdata provides
the easiest and
more efficient way
to access your
logs, by utilizing
resources and
tools you already
use today.
56
Mission accomplished!
Netdata
Challenge 5:
Observability is more than metrics, logs
and traces. What is missing?
Page:
To completely understand or effectively troubleshoot an issue,
metrics, logs and traces may not be enough.
What if we need to examine:
●the slow queries on a database,
●the list of network connectionsan application has,
●the files in a filesystem,
●… and the plethora of non-metric, non-log, non-tracing
information available?
Most monitoring systems give up. You have to use the console of
your database server, ssh to the server, or (for others :) restart
the problematic component or application and hope the issue
goes away…
Can a monitoring system help?
To completely
understand, or
effectively
troubleshoot an
issue, we need
more!
58
Challenge
Page:
plugin
GP
59
Netdata Functions
A1 A2 A3 A4 A5
B1 B2 B3 B4 B5
C1 C2 C3 C4 C5
PA
PB
PC
Data Center 1 Data Center 2
Cloud Provider 1
User is accessing a
function exposed by
a data collection
plugin on B5
Alerting
Netdata Parent
Netdata Parent
Netdata Parent
Netdata
Grandparent
function
Page:
60
Example: Network Connections Explorer
Page:
●Data collection plugins expose Functions.
Functions have a name, some parameters, accept a payload, return a
payload and require some permissions to access them.
All these can be custom for each and every function.
●Parents are aware of their childrens’ Functions.
Parents are updated in real-time about changes to Functions, so that all
nodes involved in a streaming and replication chain are always up to date
for the available functions of the entire infra behind them.
●Dashboards provide the list of Functions.
●Netdata UI supports widgets for Functions.
We are standardizing a set of UI widgets capable of presenting different
kinds of data, depending on which is the most appropriate way for them
to be presented.
Functions are
data collection
plugin features
to query non-
metric data of
any kind
61
Mission accomplished!
Netdata
Monetization Strategy
Page:
●Horizontal Scalability
NC provides unified dashboards and alerts, and dispatches alerts centrally, without the
need to centralize all data on one server. Behind the scenes it queries multiple Netdata
and aggregates their responses on the fly.
●Role Based Access Control (RBAC)
NC allows grouping infrastructure and users in “war rooms”, limiting and controlling users’
access to the infrastructure.
NC also acts as a Single-Sign-On provider for all your Netdata, limiting what users can see
even when they access Netdata directly.
●Access from anywhere
NC allows accessing your Netdata servers from anywhere, without the need for a VPN.
●Mobile App for Notifications
NC enables the use of the Netdata Mobile App (iOS, Android) for receiving alert
notifications.
●Persisted Customizations and Dynamic Configuration
NC enables dynamic configuration and stores user settings, custom dashboards,
personalized views and related settings and options, per node, user, room, and space.
Netdata Cloud (NC)
complements
Netdata
63
Monetization through SaaS
IMPORTANT
Netdata Cloud does not centralize your data.
Your data are always, and exclusively on-prem, inside the Netdata you install.
Netdata Cloud queries your Netdata in real-time, to present dashboards and alerts.
Thank You!
Costa Tsaousis
:GitHub URL, https://github.com/netdata/netdata
Page:
●High fidelity monitoring
All metrics are collected per second.
Thousands versus hundreds of metrics.
Hundreds versus dozens of alerts.
●All metrics are visualized
Fully automated infrastructure level dashboards, visualizing all
metrics collected.
●Powerful visualization
Slice and dice any dataset, using controls available on all charts. No
need to learn a query language.
●Unsupervised anomaly detection
Detects anomalies by learning the behavior of each metric.
●Out of the box alerts
Predefined alerts use rolling windows and statistical functions to
detect common issues, without fixed thresholds.
From Zero
To Hero
Today!
65
The Netdata way
Page:
66
Why Netdata?
We get lost in a sea of
challenges related to
monitoring…
…instead of improving
our infrastructure!