AI-Driven DevOps: How LLMs
and Agents Are Changing
Software Delivery
Kedar Kulkarni
All Things Open 2025
About Me
1 Senior DevOps architect at
Apple
Previously VMware and Red
Hat
2 Co-founder of Ansible Tower
Config as Code with 1M+
downloads
3 Talk to me about: Ansible,
Virtualization, Kubernetes,
Containers, CI/CD,
motorcycles
4 Motorcycle enthusiast -
Ninja650 ³ ZX-6R
COTA track day experience
Disclaimer: Not representing Apple; views are my own
Today's Agenda
DevOps & Software Delivery
Context
Decoding AI, LLMs & Agents DevOps Pain Points &
Human Cost
AI as the Great Equalizer Ethics & Limitations in AI
DevOps
Practical AI Solutions &
Demos
Open Source Tools & Getting Started
Software Delivery Reality
Traditional View:
Code ³ Build ³ Test ³ Deploy ³
Monitor
Reality:
Complex ecosystem requiring
expertise across:
Multi-cloud infrastructure & hybrid environments
Microservices dependencies & service-to-service communication
Security & compliance requirements
Incident response & capacity planning
Reframing "Toil"
Not Just Busy Work, But Cognitive Load
Pattern Recognition
Scanning logs for errors
Identifying resource trends
Correlating across systems
Context Switching
Finding runbook locations
Recalling procedures
Translating between tools
Knowledge Access
Finding the right expert
Searching Slack history
Deciphering legacy docs
The Human Cost of DevOps
Pain
"Toil" Varies By:
Experience
5min vs 2hrs
Domain
Network vs
App
Context
Routine vs
Emergency
Access
Permissions &
Knowledge
Pager fatigue & burnout
15+ scattered tools
Outside-expertise P0s
Brittle automation
Decoding AI for DevOps Teams
LLMs
Advanced pattern matching for text
Like a senior engineer who's read everything
In DevOps Terms:
Smart search + content generation
AI Agents
LLMs + tools + decision making
Reliable teammate that can execute tasks
In DevOps Terms:
Smart search + execution capabilities
AI as the Great Equalizer
Junior Engineers
Instant institutional knowledge
Log explanation in plain language
Guided troubleshooting steps
Senior Engineers
Handles pattern recognition
Synthesizes information
Acts as thinking partner
Non-Native English Speakers
Translates technical jargon
Assists with documentation
Different Learning Styles
Visual: Diagrams & flowcharts
Sequential: Step-by-step guides
Diverse Perspectives
Startup
5-person team, multiple hats
AI helps with knowledge gaps
Focus: Quick wins, rapid iteration
Enterprise
Specialized teams, complex
approvals
AI helps with coordination
Focus: Compliance, security,
standards
Mid-Size
Growing pains, some process, some
chaos
AI helps with scaling operations
Focus: Building sustainable practices
Different contexts require different AI applications
Quick poll: Please indicate which category best describes your team: Startup, Mid-Size, or Enterprise?
Demo 1: AI-Assisted Troubleshooting
(Kubernetes Scenario)
Scenario: Production Kubernetes cluster shows high pod restart rates for a critical
service.
Junior View
Explain the pod restart issue step-by-step for a junior engineer,
including common causes and initial troubleshooting commands.
Senior View
Analyze recent deployment changes and service logs to identify
patterns contributing to the high pod restart rates across similar
incidents.
Security View
Scan for any unusual network activity, failed authentication attempts,
or privilege escalations that could indicate malicious activity related to
the pod restarts.
Business View
Assess the current customer impact of the service degradation and
provide an estimated time to resolution (ETA) based on available data
and troubleshooting progress.
04:23
YouTube
K8sGPT: AI-Driven DevOps Troubleshoot&
Did you catch my presentation at All Things
Open (ATO) 2025? This video is a full&
Demo 2: Self-healing K8s
System
AI agents autonomously discover, diagnose, and
remediate errors to ensure continuous system
stability.
Intelligent Error Detection
AI continuously scans logs, metrics, and traces to identify subtle
anomalies and errors before they can escalate into major incidents.
Automated Root Cause Analysis (RCA)
The AI correlates data across various sources (metrics, logs,
deployment history) to accurately pinpoint the exact cause of
operational issues.
Self-Executing Remediation
AI agents automatically apply known remediation patterns, such as
restarting failing services, scaling resources, or rolling back recent
deployments.
Improved MTTR
AI agents dramatically reduce Mean Time To Recovery (MTTR) by
eliminating human response delays and instantly applying fixes,
cutting incident resolution time from hours down to minutes.
01:51
YouTube
How to Build a Self-Healing Kubernetes &
In this demo, we showcase AI-powered self-
healing for Kubernetes clusters using the&
AI Ethics & Limitations in
DevOps
Key Considerations
Bias in automation
decisions
Transparency of
recommendations
Human oversight for critical
tasks
Guardrails
Operational
Knowledge Gap
AI training is rich in code,
poor in ops wisdom
DevOps knowledge lives in:
Private Slack threads
War room discussions
Tribal knowledge
AI Knowledge +
Human
Experience =
Infinite
Possibility
Open Source AI Tools for DevOps
Infrastructure
tfgpt
k8sgpt
kubectl-ai
AWS Copilot CLI
Development
aider
cline
roo code
Monitoring
keep
Models
Llama
Mistral
DeepSeek
Getting Started:
Implementation Roadmap
1 Week 1: Assessment & Quick Wins
Audit top 3 time-consuming tasks
Try existing tools: GitHub Copilot, ChatGPT
Goal: Save 30 min/person in first week
2 Month 1: Pilot Implementation
Choose: troubleshooting OR documentation
Build simple integration (Slack bot, dashboard)
Goal: 20% faster incidents OR 50% more docs
3 Quarter 1: Scale & Integrate
Expand successful pilots
Build internal knowledge base for AI
Goal: 80% team adoption of AI tools
Different Starting Points
High-Traffic / Large Teams
Focus: Self-service automation and intelligent routing
Specialized / Expert Teams
Focus: Knowledge capture and junior engineer onboarding
Distributed / Remote Teams
Focus: Communication enhancement and async
collaboration
Budget-Conscious Teams
Focus: Free/open-source tools and gradual integration
Risk Mitigation:
Always maintain human oversight " Start with non-production " Establish clear guidelines
Key Takeaways
AI amplifies human capabilities, doesn't replace judgment
Different teams benefit from different AI applications
Start small, measure impact, iterate based on your needs
Ethics and inclusivity should be built in from the beginning
Discussion Questions
What's your biggest "toil" challenge?
How might your team's diverse perspectives
benefit from AI assistance?
What concerns do you have about AI in your
workflows?
Which quick win would be most valuable for
your team?
Thank You
AI-Driven DevOps: How LLMs and Agents Are Changing
Software Delivery
Connect with me to continue the conversation
KEDARKULKARNI.in
ATO2025@KEDARKULKARNI.in
https://linkedin.com/in/kkulkar3

AI-Driven DevOps: How LLMs and Agents Are Changing Software Delivery

  • 1.
    AI-Driven DevOps: HowLLMs and Agents Are Changing Software Delivery Kedar Kulkarni All Things Open 2025
  • 2.
    About Me 1 SeniorDevOps architect at Apple Previously VMware and Red Hat 2 Co-founder of Ansible Tower Config as Code with 1M+ downloads 3 Talk to me about: Ansible, Virtualization, Kubernetes, Containers, CI/CD, motorcycles 4 Motorcycle enthusiast - Ninja650 ³ ZX-6R COTA track day experience Disclaimer: Not representing Apple; views are my own
  • 3.
    Today's Agenda DevOps &Software Delivery Context Decoding AI, LLMs & Agents DevOps Pain Points & Human Cost AI as the Great Equalizer Ethics & Limitations in AI DevOps Practical AI Solutions & Demos Open Source Tools & Getting Started
  • 4.
    Software Delivery Reality TraditionalView: Code ³ Build ³ Test ³ Deploy ³ Monitor Reality: Complex ecosystem requiring expertise across: Multi-cloud infrastructure & hybrid environments Microservices dependencies & service-to-service communication Security & compliance requirements Incident response & capacity planning
  • 5.
    Reframing "Toil" Not JustBusy Work, But Cognitive Load Pattern Recognition Scanning logs for errors Identifying resource trends Correlating across systems Context Switching Finding runbook locations Recalling procedures Translating between tools Knowledge Access Finding the right expert Searching Slack history Deciphering legacy docs
  • 6.
    The Human Costof DevOps Pain "Toil" Varies By: Experience 5min vs 2hrs Domain Network vs App Context Routine vs Emergency Access Permissions & Knowledge Pager fatigue & burnout 15+ scattered tools Outside-expertise P0s Brittle automation
  • 7.
    Decoding AI forDevOps Teams LLMs Advanced pattern matching for text Like a senior engineer who's read everything In DevOps Terms: Smart search + content generation AI Agents LLMs + tools + decision making Reliable teammate that can execute tasks In DevOps Terms: Smart search + execution capabilities
  • 8.
    AI as theGreat Equalizer Junior Engineers Instant institutional knowledge Log explanation in plain language Guided troubleshooting steps Senior Engineers Handles pattern recognition Synthesizes information Acts as thinking partner Non-Native English Speakers Translates technical jargon Assists with documentation Different Learning Styles Visual: Diagrams & flowcharts Sequential: Step-by-step guides
  • 9.
    Diverse Perspectives Startup 5-person team,multiple hats AI helps with knowledge gaps Focus: Quick wins, rapid iteration Enterprise Specialized teams, complex approvals AI helps with coordination Focus: Compliance, security, standards Mid-Size Growing pains, some process, some chaos AI helps with scaling operations Focus: Building sustainable practices Different contexts require different AI applications Quick poll: Please indicate which category best describes your team: Startup, Mid-Size, or Enterprise?
  • 10.
    Demo 1: AI-AssistedTroubleshooting (Kubernetes Scenario) Scenario: Production Kubernetes cluster shows high pod restart rates for a critical service. Junior View Explain the pod restart issue step-by-step for a junior engineer, including common causes and initial troubleshooting commands. Senior View Analyze recent deployment changes and service logs to identify patterns contributing to the high pod restart rates across similar incidents. Security View Scan for any unusual network activity, failed authentication attempts, or privilege escalations that could indicate malicious activity related to the pod restarts. Business View Assess the current customer impact of the service degradation and provide an estimated time to resolution (ETA) based on available data and troubleshooting progress. 04:23 YouTube K8sGPT: AI-Driven DevOps Troubleshoot& Did you catch my presentation at All Things Open (ATO) 2025? This video is a full&
  • 11.
    Demo 2: Self-healingK8s System AI agents autonomously discover, diagnose, and remediate errors to ensure continuous system stability. Intelligent Error Detection AI continuously scans logs, metrics, and traces to identify subtle anomalies and errors before they can escalate into major incidents. Automated Root Cause Analysis (RCA) The AI correlates data across various sources (metrics, logs, deployment history) to accurately pinpoint the exact cause of operational issues. Self-Executing Remediation AI agents automatically apply known remediation patterns, such as restarting failing services, scaling resources, or rolling back recent deployments. Improved MTTR AI agents dramatically reduce Mean Time To Recovery (MTTR) by eliminating human response delays and instantly applying fixes, cutting incident resolution time from hours down to minutes. 01:51 YouTube How to Build a Self-Healing Kubernetes & In this demo, we showcase AI-powered self- healing for Kubernetes clusters using the&
  • 12.
    AI Ethics &Limitations in DevOps Key Considerations Bias in automation decisions Transparency of recommendations Human oversight for critical tasks Guardrails Operational Knowledge Gap AI training is rich in code, poor in ops wisdom DevOps knowledge lives in: Private Slack threads War room discussions Tribal knowledge
  • 13.
    AI Knowledge + Human Experience= Infinite Possibility
  • 14.
    Open Source AITools for DevOps Infrastructure tfgpt k8sgpt kubectl-ai AWS Copilot CLI Development aider cline roo code Monitoring keep Models Llama Mistral DeepSeek
  • 15.
    Getting Started: Implementation Roadmap 1Week 1: Assessment & Quick Wins Audit top 3 time-consuming tasks Try existing tools: GitHub Copilot, ChatGPT Goal: Save 30 min/person in first week 2 Month 1: Pilot Implementation Choose: troubleshooting OR documentation Build simple integration (Slack bot, dashboard) Goal: 20% faster incidents OR 50% more docs 3 Quarter 1: Scale & Integrate Expand successful pilots Build internal knowledge base for AI Goal: 80% team adoption of AI tools
  • 16.
    Different Starting Points High-Traffic/ Large Teams Focus: Self-service automation and intelligent routing Specialized / Expert Teams Focus: Knowledge capture and junior engineer onboarding Distributed / Remote Teams Focus: Communication enhancement and async collaboration Budget-Conscious Teams Focus: Free/open-source tools and gradual integration Risk Mitigation: Always maintain human oversight " Start with non-production " Establish clear guidelines
  • 17.
    Key Takeaways AI amplifieshuman capabilities, doesn't replace judgment Different teams benefit from different AI applications Start small, measure impact, iterate based on your needs Ethics and inclusivity should be built in from the beginning
  • 18.
    Discussion Questions What's yourbiggest "toil" challenge? How might your team's diverse perspectives benefit from AI assistance? What concerns do you have about AI in your workflows? Which quick win would be most valuable for your team?
  • 19.
    Thank You AI-Driven DevOps:How LLMs and Agents Are Changing Software Delivery Connect with me to continue the conversation KEDARKULKARNI.in ATO2025@KEDARKULKARNI.in https://linkedin.com/in/kkulkar3