Will the Future Belong to Kubernetes as an infrastructure provider for Databases?!

AlirezaKamrani719 8 views 14 slides Oct 20, 2025
Slide 1
Slide 1 of 14
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14

About This Presentation

Kubernetes and databases, friends or foes?

Will the Future Belong to Kubernetes as an infrastructure provider for Databases?!


Alireza Kamrani

About existing challenges & Concerns to implementing Databases and Stateful products inside Kubernetes:

đź—żRethinking D...


Slide Content

Kubernetes and databases, friends or foes?

Will the Future Belong to Kubernetes as a
infrastructure provider for Databases?!

DATABASE
BOX

Alireza Kamrani

About existing challenges & Concerns to
implementing Databases and Stateful
products inside Kubernetes:

# Rethinking Databases and Stateful
Workloads in Cloud-Native Environments

Kubernetes has become the de facto standard
for container orchestration, largely because it
helps organizations reduce infrastructure costs
and simplify scalability and deployment. Many
cloud providers have extended Kubernetes

capabilities to support relational databases,
often integrating them into hybrid or managed
environments to improve cost efficiency and
elasticity.

However, behind these services are specialized
engineering teams fine-tuning complex
infrastructure layers—particularly custom
operators that handle database orchestration,
replication, and recovery. For high-end
databases like Oracle, only Oracle’s own cloud
platform provides a truly stable and performant
managed service, thanks to its specialized
hardware stack. The same is true for systems
like Kafka, where performance and resilience are
extremely sensitive to latency and temporary
downtime.

The Challenge of Stateful Workloads on
Kubernetes

Running stateful workloads—especially
databases—on Kubernetes is not inherently
wrong, but it comes with deep technical
implications.

Kubernetes was originally designed for stateless
applications, where scaling and failover are
simple. In contrast, databases rely on data
consistency, low latency, and controlled failover,
which require careful orchestration.

The main challenges typically arise from:

Persistent Volumes and Storage Classes —
The performance and stability of a database
cluster depend heavily on the storage
backend (e.g., Ceph, EBS, Longhorn, or
OpenEBS).

Network Complexity — Internal/external
networking, service mesh configuration, and
pod-to-pod latency can significantly impact
query performance.

Operator Maturity — A robust and
customized Database Operator is essential
to manage failover, replication, and
consistency across nodes.

Team Expertise — Database administrators
and Kubernetes engineers must collaborate
closely; operational silos often lead to
instability in production.

DATABASE
BOX

Hybrid Architectures: A Practical Middle
Ground

For most organizations, the best approach today
is a hybrid architecture:

+ Run stateless applications inside
Kubernetes, benefiting from agility and
automation.

+ Keep mission-critical databases outside, in
managed or dedicated environments that
guarantee performance and reliability.

This model allows teams to mature their
Kubernetes operations gradually, while avoiding
risks associated with stateful workloads. Over
time, as Kubernetes-native storage and
operators evolve, more stateful workloads will
safely migrate inside the cluster.

EJWhen Kubernetes Kills Database
Performance?

Kubernetes can severely degrade database
performance when certain architectural and
operational pitfalls are ignored. Here's where
things usually go wrong:

Improper Storage Layer

Using slow or non-SSD storage backends (e.g.,
NFS or network volumes with high latency)
causes transaction delays and poor IOPS for
write-heavy databases.

— Always validate your StorageClass and
underlying disk throughput.

Over-Aggressive Pod Scheduling

When the scheduler relocates database pods
frequently (due to node pressure or autoscaling),
cache locality is lost and persistent volume
reattachment delays occur.

> Use node affinity, anti-affinity, and proper
resource reservations.

Uncoordinated Scaling Events

Horizontal Pod Autoscalers (HPA) can't easily
understand transactional behavior. Blind scaling
might increase concurrency but break
performance consistency.

— Combine autoscaling with workload-aware
metrics and custom logic.

Unoptimized Network Path

Service Mesh or ingress routing adds overhead.
Databases sensitive to latency (like Oracle and
it's standby or PostgreSQL under heavy load)
may experience jitter.

> Keep DB pods close to clients and reduce
unnecessary proxy layers.

Improper Resource Requests and Limits
CPU throttling and memory eviction due to
misconfigured limits can silently destroy
database performance under load.

> Tune requests based on steady-state
benchmarks, not theoretical peaks.

Lack of Observability

Without proper tracing, Prometheus metrics, or
log correlation, root cause detection becomes

guesswork.

> Visibility is everything—monitor 1/0, latency,
and network hops continuously.

Finally, Kubernetes doesn't kill performance —
poor design does.

When engineered thoughtfully, databases can
run stably inside clusters, especially for medium-

priority or non-critical workloads.

% Troubleshooting and Proof of Performance

How DBA and SRE Teams Can Isolate “Slow
States”

One of the toughest challenges for DBA and SRE
teams in Kubernetes environments is proving
that a slowdown didn't originate on their side.
Application teams often point to “the database”

when latency spikes appear, but validating this
requires a structured troubleshooting approach.

Recommended steps to isolate issues:

DATABASE
BOX

1. Establish Baselines

Continuously record baseline metrics: query
latency, IOPS, CPU, and memory per database
pod.

Use time-series analysis to compare “slow
periods” against normal operation.

2. Correlate Metrics Across Layers
Combine database logs, Kubernetes metrics (via

kubectl top pods or Prometheus), and
application traces.

If latency increases without CPU/I/O pressure
on DB pods, the issue may lie in the network or
app layer.

3. Check Storage and Volume Attach Events
Review kubectl describe pods and node events
for PersistentVolume reattachments. Even brief
delays can create performance cliffs.

4. Inspect Node and Network Health

Verify if kubelet restarts, CNI latency, or DNS
lookup delays occurred around the same
timestamp.

5. Validate Application Query Behavior

Use query plans and APM tracing to confirm that
slow queries weren't caused by schema changes
or unindexed joins.

6. Prove Negative Evidence Clearly

Build an evidence bundle: Grafana dashboards +

database slow logs + system metrics.

This documentation helps SRE/DBA teams
demonstrate that the database layer remained
stable.

By following this methodology, teams can
defend against false blame while improving
collaboration between DBAs, SREs, and
developers.

It's not just about finding the root cause — it's
about building trust through transparency and
data correlation.

The Role of Al and Intelligent

Operators

Recent advancements in Artificial Intelligence
(Al) and Machine Learning (ML) are beginning to
transform how Kubernetes operators are
designed and optimized.

While most database operators today handle
replication, failover, and backup reliably, the next
generation will integrate Al-driven intelligence to

automate complex decisions and minimize
human intervention.

DATABASE
BOX

Key emerging directions include:

Predictive Scaling & Auto-Tuning — Operators
that analyze workload patterns and dynamically
adjust replicas or resources based on CPU, 1/0,
and latency trends.

Anomaly Detection — Identifying performance
degradation or abnormal database behavior and
automatically triggering recovery actions.

Intelligent Failover Decisions — Choosing the
best failover node by evaluating historical health,
latency, and resource utilization rather than
static rules.

Self-Healing Configurations — Continuously
optimizing tuning parameters based on real-time
metrics and feedback loops.

Some existing operators already provide a
strong foundation for this vision:

- Crunchy PostgreSQL Operator (by CrunchyData)
- Percona Kubernetes Operator (for MySQL and
MongoDB)

- Oracle Cloud Native Database Operator

- Operator Framework / Operator SDK for
building custom Al-enabled operators.

The combination of these tools with ML
pipelines and observability platforms like
Prometheus, Grafana, and Cortex opens the door
to truly autonomous and self-optimizing
Kubernetes database management in the near
future.

Best Practices and Recommendations
If you're exploring Kubernetes for database
workloads, consider the following:

Start small — Begin with development or
sandbox environments before touching
production.

Choose your storage backend wisely —
Latency, throughput, and availability vary
dramatically between storage solutions.
Use StatefulSets, not Deployments —
StatefulSets ensure stable network identities
and persistent volumes for database pods.
Invest in a strong operator — For databases
like PostgreSQL, MongoDB, or MySQL,
mature operators (such as CrunchyData or
Percona) can dramatically improve
resilience.

Monitor everything — Use Prometheus and
Grafana for metrics, and enable audit
logging for database operations.

Looking Ahead:

The future does belong to Kubernetes — but not
all at once, and not for every workload.
Stateless services will continue to dominate
Kubernetes clusters, while stateful workloads
will transition more slowly as the ecosystem
matures. With advancements in storage
abstraction, distributed volume systems, and
intelligent operators, the boundary between “in-
cluster” and “external” databases will blur over
the next few years.

Until then, success depends less on tools and
more on team expertise, observability, and
iterative learning.

Start small, learn fast, and scale with confidence.

Alireza Kamrani
Database infrastructure manager