AI-Driven DevOps: How LLMs and Agents Are Changing Software Delivery

AllThingsOpen 12 views 19 slides Oct 20, 2025
Slide 1
Slide 1 of 19
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19

About This Presentation

Presented at All Things Open 2025
Presented by Kedar Kulkarni - Apple Inc.

Title: AI-Driven DevOps: How LLMs and Agents Are Changing Software Delivery
Abstract: The rise of Large Language Models (LLMs) and AI Agents is transforming DevOps, enabling faster incident resolution, intelligent automation...


Slide Content

AI-Driven DevOps: How LLMs
and Agents Are Changing
Software Delivery
Kedar Kulkarni
All Things Open 2025

About Me
1Senior DevOps architect at
Apple
Previously VMware and Red
Hat
2Co-founder of Ansible Tower
Config as Code with 1M+
downloads
3Talk to me about: Ansible,
Virtualization, Kubernetes,
Containers, CI/CD,
motorcycles
4Motorcycle enthusiast -
Ninja650 ³ ZX-6R
COTA track day experience
Disclaimer: Not representing Apple; views are my own

Today's Agenda
DevOps & Software Delivery
Context
Decoding AI, LLMs & Agents DevOps Pain Points &
Human Cost
AI as the Great Equalizer Ethics & Limitations in AI
DevOps
Practical AI Solutions &
Demos
Open Source Tools & Getting Started

Software Delivery Reality
Traditional View:
Code ³ Build ³ Test ³ Deploy ³
Monitor
Reality:
Complex ecosystem requiring
expertise across:
Multi-cloud infrastructure & hybrid environments
Microservices dependencies & service-to-service communication
Security & compliance requirements
Incident response & capacity planning

Reframing "Toil"
Not Just Busy Work, But Cognitive Load
Pattern Recognition
Scanning logs for errors
Identifying resource trends
Correlating across systems
Context Switching
Finding runbook locations
Recalling procedures
Translating between tools
Knowledge Access
Finding the right expert
Searching Slack history
Deciphering legacy docs

The Human Cost of DevOps
Pain
"Toil" Varies By:
Experience
5min vs 2hrs
Domain
Network vs
App
Context
Routine vs
Emergency
Access
Permissions &
Knowledge
Pager fatigue & burnout
15+ scattered tools
Outside-expertise P0s
Brittle automation

Decoding AI for DevOps Teams
LLMs
Advanced pattern matching for text
Like a senior engineer who's read everything
In DevOps Terms:
Smart search + content generation
AI Agents
LLMs + tools + decision making
Reliable teammate that can execute tasks
In DevOps Terms:
Smart search + execution capabilities

AI as the Great Equalizer
Junior Engineers
Instant institutional knowledge
Log explanation in plain language
Guided troubleshooting steps
Senior Engineers
Handles pattern recognition
Synthesizes information
Acts as thinking partner
Non-Native English Speakers
Translates technical jargon
Assists with documentation
Different Learning Styles
Visual: Diagrams & flowcharts
Sequential: Step-by-step guides

Diverse Perspectives
Startup
5-person team, multiple hats
AI helps with knowledge gaps
Focus: Quick wins, rapid iteration
Enterprise
Specialized teams, complex
approvals
AI helps with coordination
Focus: Compliance, security,
standards
Mid-Size
Growing pains, some process, some
chaos
AI helps with scaling operations
Focus: Building sustainable practices
Different contexts require different AI applications
Quick poll: Please indicate which category best describes your team: Startup, Mid-Size, or Enterprise?

Demo 1: AI-Assisted Troubleshooting
(Kubernetes Scenario)
Scenario: Production Kubernetes cluster shows high pod restart rates for a critical
service.
Junior View
Explain the pod restart issue step-by-step for a junior engineer,
including common causes and initial troubleshooting commands.
Senior View
Analyze recent deployment changes and service logs to identify
patterns contributing to the high pod restart rates across similar
incidents.
Security View
Scan for any unusual network activity, failed authentication attempts,
or privilege escalations that could indicate malicious activity related to
the pod restarts.
Business View
Assess the current customer impact of the service degradation and
provide an estimated time to resolution (ETA) based on available data
and troubleshooting progress.
04:23
YouTube
K8sGPT: AI-Driven DevOps Troubleshoot&
Did you catch my presentation at All Things
Open (ATO) 2025? This video is a full&

Demo 2: Self-healing K8s
System
AI agents autonomously discover, diagnose, and
remediate errors to ensure continuous system
stability.
Intelligent Error Detection
AI continuously scans logs, metrics, and traces to identify subtle
anomalies and errors before they can escalate into major incidents.
Automated Root Cause Analysis (RCA)
The AI correlates data across various sources (metrics, logs,
deployment history) to accurately pinpoint the exact cause of
operational issues.
Self-Executing Remediation
AI agents automatically apply known remediation patterns, such as
restarting failing services, scaling resources, or rolling back recent
deployments.
Improved MTTR
AI agents dramatically reduce Mean Time To Recovery (MTTR) by
eliminating human response delays and instantly applying fixes,
cutting incident resolution time from hours down to minutes.
01:51
YouTube
How to Build a Self-Healing Kubernetes &
In this demo, we showcase AI-powered self-
healing for Kubernetes clusters using the&

AI Ethics & Limitations in
DevOps
Key Considerations
Bias in automation
decisions
Transparency of
recommendations
Human oversight for critical
tasks
Guardrails
Operational
Knowledge Gap
AI training is rich in code,
poor in ops wisdom
DevOps knowledge lives in:
Private Slack threads
War room discussions
Tribal knowledge

AI Knowledge +
Human
Experience =
Infinite
Possibility

Open Source AI Tools for DevOps
Infrastructure
tfgpt
k8sgpt
kubectl-ai
AWS Copilot CLI
Development
aider
cline
roo code
Monitoring
keep
Models
Llama
Mistral
DeepSeek

Getting Started:
Implementation Roadmap
1 Week 1: Assessment & Quick Wins
Audit top 3 time-consuming tasks
Try existing tools: GitHub Copilot, ChatGPT
Goal: Save 30 min/person in first week
2 Month 1: Pilot Implementation
Choose: troubleshooting OR documentation
Build simple integration (Slack bot, dashboard)
Goal: 20% faster incidents OR 50% more docs
3 Quarter 1: Scale & Integrate
Expand successful pilots
Build internal knowledge base for AI
Goal: 80% team adoption of AI tools

Different Starting Points
High-Traffic / Large Teams
Focus: Self-service automation and intelligent routing
Specialized / Expert Teams
Focus: Knowledge capture and junior engineer onboarding
Distributed / Remote Teams
Focus: Communication enhancement and async
collaboration
Budget-Conscious Teams
Focus: Free/open-source tools and gradual integration
Risk Mitigation:
Always maintain human oversight " Start with non-production " Establish clear guidelines

Key Takeaways
AI amplifies human capabilities, doesn't replace judgment
Different teams benefit from different AI applications
Start small, measure impact, iterate based on your needs
Ethics and inclusivity should be built in from the beginning

Discussion Questions
What's your biggest "toil" challenge?
How might your team's diverse perspectives
benefit from AI assistance?
What concerns do you have about AI in your
workflows?
Which quick win would be most valuable for
your team?

Thank You
AI-Driven DevOps: How LLMs and Agents Are Changing
Software Delivery
Connect with me to continue the conversation
KEDARKULKARNI.in
[email protected]
https://linkedin.com/in/kkulkar3