From Gatekeeper to Kyverno : Kubernetes Policy Management with Performance by Tanat Lokejaroenlarb

ScyllaDB 1 views 32 slides Oct 17, 2025
Slide 1
Slide 1 of 32
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32

About This Presentation

This talk shares our journey migrating from Gatekeeper to Kyverno for Kubernetes policy management at Adevinta. Faced with the need for resource mutation beyond Gatekeeper’s capabilities, we explored Kyverno’s out-of-the-box support for both validation and mutation. We’ll cover the challenges,...


Slide Content

A ScyllaDB Community
From Gatekeeper to Kyverno:
Kubernetes Policy Management
with Performance
Tanat Lokejaroenlarb
Staff Site Reliability Engineer

Tanat Lokejaroenlarb

Staff Site Reliability Engineer at Adevinta
■Runtime team focusing on SRE and Platform
Engineering
■I write blog posts about SRE and real world
incidents on https://tanatloke.medium.com
■I’m from ?????? living in Barcelona ??????

Agenda overview

■What do we do at Adevinta and why do we need “policies”?
■What motivates the evaluation of different tools?
■Kyverno vs OPA: migration and lesson learned

What do we do @Adevinta?

SCHIP
Internal Kubernetes Platform with managed capabilities hosting workloads for
e-commerce marketplaces across Europe
■30+ Production clusters in 4 AWS regions
■2k+ nodes, 80k+ pods, 250k+ rps at peak time
■“Multi-tenant”*

Policy management is a
cornerstone of Multi-tenant
Kubernetes

Policy management is used in different aspects
■[Security] Enforce Ingress hostname convention
●Prevent duplicated ingress hosts across namespaces
■[Abstraction] Prohibit annotations that interfere with our integration
●This ensure we avoid tight coupling with implementation detail, useful for migration
■[Operation] Prevent system tolerations
●Isolates system workloads and user workloads for stability/security



OPA was originally the only sane option. It’s battle-tested.

OPA policy example
Ensure unique hostname inside the cluster

violation[{"msg": msg}]
host := input.review.object.spec.rules[_].host
myns := input.review.object.metadata.namespace
other := data.inventory.namespace[_][otherapiversion]["Ingress"][name]
re_match("^(extensions|networking.k8s.io)/.+$", otherapiversion)
not host in input.parameters.UniqHostExceptions
other.spec.rules[_].host == host
other.metadata.namespace != input.review.object.metadata.namespace
msg := sprintf("ingress host '%v' in namespace %v conflicts with ingress '%v on ns %v' ",
[host, myns, name, other.metadata.namespace ])
}

It works, but….

1.REGO Complexity

violation[{"msg": msg}]
host := input.review.object.spec.rules[_].host
myns := input.review.object.metadata.namespace
other := data.inventory.namespace[_][otherapiversion]["Ingress"][name]
re_match("^(extensions|networking.k8s.io)/.+$", otherapiversion)
not host in input.parameters.UniqHostExceptions
other.spec.rules[_].host == host
other.metadata.namespace != input.review.object.metadata.namespace
msg := sprintf("ingress host '%v' in namespace %v conflicts with ingress '%v on ns
%v' ",
[host, myns, name, other.metadata.namespace ])
}

REGO is hard to understand and increase cognitive load

There’s only a few members in the team who are confident with REGO

2. Limited Mutating capabilities*
With increasing cases, we rely a lot on “Mutating capabilities” in our multi-tenant set up
■[Abstraction] Provide features with annotations
●Add nodeSelector/tolerations automatically based on annotation for nodepool selection
■[Operation] Inject hints based on specific resources types
●Prevent unintended disruptions for Cronjobs/Jobs pods
■[Operation] Provide a sane default configuration
●PodDisruptionBudgets, Resources request/limits

OPA has Assign/AssignMetadata, but less flexible


Assign or replace

3. Resource consumption at Scale!
Accessing real time state of cluster’s objects is common when
writing a more than simple validating policy.

3. Resource consumption at Scale!
Resource grows with more objects being involved

The alternative: Kyverno

1.YAML is SRE/DevOps Best friend
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: prevent-restricted-toleration
spec:
validationFailureAction: Enforce
rules:
- name: prevent-restricted-toleration
match:
any:
- resources:
kinds:
- Pod
operations:
- CREATE
- UPDATE



preconditions:
all:
- key: "{{ request.object.spec.tolerations[*].key }}"
operator: AnyIn
value:
- special-nodes
validate:
message: "Pod is denied because it has a forbidden toleration key
(schip-controller"
anyPattern:
- spec:
tolerations:
- key: "special-nodes"
operator: "*"
value: "*"
effect: "*"

2. Mutating made simple
mutate:
patchStrategicMerge:
spec:
template:
spec:
dnsConfig:
options:
- name: ndots
value:
"{{request.object.metadata.annotations.\"schip/
extended-ndots\”}}"
mutate:
patchesJson6902: |-
- path: "/spec/tolerations/-"
op: add
value:
key:
"alpha.gpu.node.schip.io/gpu"
operator: "Exists"
effect: "NoSchedule"

3. Less memory with apiCall
Similar caching style via Global Context Entry is available with more
namespace filter

4. Support Kubernetes native VAP/MAP and CEL

The migration

Made it official, Build consensus (ADR)

Gradual Migration Strategy
■No new policies will be added to OPA (bankruptcy)
●All new policies for both Validating/Mutating will be done with Kyverno
■Gradually migrate existing policies (priority)
●Ease of Moving (Rule Complexity)
■High Priority: Simple or standard policies that can be quickly expressed in Kyverno.
■Lower Priority: Complex Gatekeeper rules with intricate Rego logic.
●Resource Consumption
■High Priority: Rules that track or validate a high volume of resources (e.g., all Pods, or large subsets of
namespaced objects). Migrating these rules first will yield the biggest gains in memory and performance
improvements.
■Lower Priority: Rules that apply to smaller subsets of resources or are rarely triggered.

Result

The lesson learned

Testing strategy

More robust via integration test

apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
name: deployment-replicas-higher-than-pdb
spec:
steps:
- name: 01 - Create policy
try:
- apply:
file: ../policy.yaml
- name: 02 - Create existing Deployments in cluster
try:
- apply:
file: existing-deployments.yaml
- name: 03 - Create bad PDBs
try:
- apply:
file: bad-pdb.yaml
expect:
- check:
($error != null): true

Start from Audit before Enforce

Webhook can affect your cluster*

It’s important to monitor latency/failure

Summary
■Policies management is important for multi-tenant platform
●OPA is robust but lacks Mutating capability and REGO increases cognitive load
●Kyverno is YAML-based and can work well with both Validating/Mutating scenario
■Migration strategy
●Start with team consensus
●Bankrupt and gradual migration, start from simple and high impact
■Monitor Latency and Denial for smooth operation

Thank you! Let’s connect.
Tanat Lokejaroenlarb
[email protected]
www.linkedin.com/in/tanatloke
https://tanatloke.medium.com
Tags