GDG Cloud Southlake #36: Kyle Forster: AI and Modern Workflow Automation: Automating Runbooks

Run
when
“This is the most practical
application of AI for DevOps that I
have seen so far, and it is working
for us today”
- Dir Platform Engineering, Fortune 25
Health/Retail

100x faster than traditional
automation initiatives, 1/5th the
budget of low-cost outsourcing

Team from Google (Kubernetes)

Investors include current and former
CIO/CTO/VPs of Google, Netﬂix,
LinkedIn, Goldman Sachs, Comcast
and Uber

Engaged with Global 2000
enterprises in ﬁnancial services,
communication service provider,
and SaaS verticals

Signed partnerships with key
Systems Integrators, including 6x
GCP partner-of-the-year

Thank you,
Jim and the GDG
Southlake crew…
Alyssa Hamulak, Nitin Raut,
Mallikarjun Dontula,
Diwakar Pandrangi,
Mike Shirk, Kenny Kon,
Yujun Liang, Ramji Bala

K8s Architecture Plan - Landing Zone
10:00-10:50, LC-52a, 12 attendees accepted
Ahmad Hassan
In #test-env
Can someone restart cart svc in test? URGENT
Li Wang
In #test-env
@Ahmad - Getting 500s from cart-service …
Ahmad Hassan
In #platform
#channel - How do I restart cart pods? URGENT!
Michael Shannon (dev)
In office today? #prod-esc (again).. URGNT
1 hour ago
48 mins ago
35 mins ago
34 mins ago
alertmanager-test-alerts
In #test-environment
Incident #0.gnslwfplaa is ongoing
2 hours ago
20 mins ago
18 mins ago
alertmanager-test-alerts
In #test-environment
Incident #0.gnslwfplaa is ongoing
2 hours ago
alertmanager-test-alerts
In #test-environment
Incident #0.gnslwfplaa is ongoing
2 hours ago
You can’t
automate this
But you could have
automated that
alertmanager-test-alerts
In #test-env-alerts
Incident #0.gnslwfplaa is ongoing (54 MINUTES)
22 mins ago
Sara Foster
Any chance that cart-postgres is full again?
One of my tests just started failing. Prob DB.

Lean teams are here to stay.
Most teams working with us are 20-40% smaller
than they were four years ago. They are left with a
lot of work to automate, but nobody has the time.

When you can’t add more engineers,
build Engineering Assistants.
100x faster than traditional automation initiatives and
~1/5th the budget of low cost outsourcing, built for the
work your team would automate if they had the time.

How many developers can your platform team
support before you need to add L0/L1 headcount?
Without Engineering
Assistants (before)
25
With Engineering
Assistants for devs (after)

110
A Tier-1 Telco platform team is providing
Assistants to developers to speed up a
mass migration to GKE
A Fortune 50 ops team is hiring expert
SREs with Assistants to replace $800m
of outsourced L0/L1 support
Outsourced
L0/L1
support
(before)

Expert SREs
with Assistants
(after)
First Project: $600k/yr cost
savings, 78% reduction in MTTR
and a high-end team
In Dev/Test In Production

Execs have executive assistants, engineers should have
engineering assistants… what could one do for you?
DEVPLATFORM PROD
-Troubleshoot dependencies
outside of my code
-Collect relevant logs and state
outside of my code when my
tests fail
-Sanity check my manifests
-Enrich tickets assigned to me
with logs, env vars, etc.
-Bump CPU/ Mem/ Storage/
Replicas when I need it
-Collect info for devs’
(repetitive) troubleshooting
-Triage noisy alerts in the
test env alerts
-Help with basic resource
right-sizing
-Sanity check manifests
-Troubleshoot and re-run
ﬂaky CI/CD jobs
-Triage test and production
alerts
-Collect info and route an
escalation to the right person
-Run broad health checks across
many components for root
cause analysis
-Collect reliability analytics
-Repetitive remediations
(restarts, expand storage,
right-size replicas…)
Intuition:
delegate
tasks that
can be done
on the CLI

Goal driven automation is 100x faster
Instead of writing code to automate a workﬂow, engineers
sync an environment with experts’ libraries and give their
Engineering Assistant a “goal.”
The Assistant runs automated steps from the libraries to
reach the goal, escalating if they can’t take the next step.
There is so little code/conﬁguration required that most
teams have their ﬁrst Assistant running in under an hour.

K8s AI Architecture Plan
10:00-10:50, LC-52a, 12 attendees accepted
Your Engineering
Assistant did this…
…so you can do that
Edgar

Your Engineering
Assistant did this…
…your execs
get that
Edgar

Your Engineering
Assistant did this…
Edgar
$
…the author
received that*
* Applicable for public automation libraries

Execute safe restart of Kubernetes
deployment cart
Check if cart-svc memory was
above 80% in the last 5 mins
Search cart-api logs for
java error messages
Production-Pali monitors cloud infra error
budgets for ecom-staging and can
reboot out-of-memory VMs or add storage
capacity before users are impacted
Test-Tania responds to #oncall-tst
non-prod alerts and runs hourly health
checks across the entire test environment
Eager-Edgar helps developers by running
diagnostics for their kubernetes
deployments in dev-dk8s-w1
Sync a cluster/cloud with RunWhen libraries to build
your ﬁrst Engineering Assistant in one hour

Production-Pali monitors cloud infra error
budgets for ecom-staging and can
reboot out-of-memory VMs or add storage
capacity before users are impacted
Test-Tania responds to #oncall-tst
non-prod alerts and runs hourly health
checks across the entire test environment
Eager-Edgar helps developers by running
diagnostics for their kubernetes
deployments in dev-dk8s-w1
A roadmap to get your team the time to build high-impact
automation across environments
Saves enough time
to build the next…
…and the next…
…and the next

This team increased feature velocity by
15% while sharing L0/L1 support thanks
to uniﬁed automation across dev,
platform and ops engineers
Collaborating Across Dev, Test And Production
-Dev → Prod: Automation to
diagnose failed tests was
re-used to triage alerts in
production
-Prod → Dev: Automation to
collect status during incidents
was re-used to maintain the
shared test environment
Story points per sprint dropped as dev
team started sharing L0/L1 support…
… and came back to a new high
thanks to increased automation

My new code
worked in the
ﬁrst few tries
I ran into
issues inside
of my code
I ran into
issues outside
of my code
Writing new code (IDE)Troubleshooting (CLI)
Developer time
spent per pull
request
* RunWhen dev team study, See also IEEE / Microsoft “Today Was A Good Day”
8x
1.2x
Engineering Assistants for developer productivity
(business case templates and benchmarks available on request)
-Dev/Test Engineering Assistants
mitigate these scenarios where
most productivity is lost
-Increase dev velocity AND reduce
escalations to platform/ops teams
-Automation started here is
re-usable for production L0/L1
support

My new code
worked in the
ﬁrst few tries
I ran into
issues inside
of my code
I ran into
issues outside
of my code
Writing new code (IDE)Troubleshooting (CLI)
Developer time
spent per pull
request
* RunWhen dev team study, See also IEEE / Microsoft “Today Was A Good Day”
8x
1.2x
Engineering Assistants for developer productivity
(business case templates and benchmarks available on request)

-The developer didn’t end up
blocked for hours on an issue
outside of their code
-The developer didn’t escalate to
the platform/devops team for help
-The developer didn’t make a fuss
about docs that weren’t updated
(which they wouldn’t have found
anyways)

Engineering Assistants for test/staging environments
(business case templates and benchmarks available on request)
“We saved so much
on our non-prod
observability costs
that it paid for the
entire RunWhen
deployment…”
-VP Engineering, AI Startup
-Broad range of automated health
measurement, detailed diagnostics,
root cause and remediation for
Kubernetes apps out-of-the-box
-Executive reporting for Service
Health and Operational Readiness
reporting

Engineering Assistants for test/staging environments
(business case templates and benchmarks available on request)
Cost Savings: Saved $12k/month of non-production logging costs
by capturing logs directly from pods on alerts/tickets/requests
Velocity: Assistant copied logs, env vars and service status to a ticket
before restarting services, reducing Dev <> QA friction
Collaboration: Dev and QA teams contributed automated health
checks speciﬁc to the application that ﬂowed through to production
for use by SREs
Test Tania Joined
This Assistant handles alerts and failed test webhooks
in the test environment, and can do basic remediation

* RunWhen survey, n=127, 2023
What are your top “keep the lights on” tasks that your
team would automate if you had the time?
Helping devs with repetitive
troubleshooting (over slack)
60%
Triaging noisy alerts in
test environments

30%
Fixing basic errors in
devs’ manifests

10%
This Assistant listens in slack to
direct engineers to the right tasks
to run in the test environment
Engineering Assistants for test/staging environments
(business case templates and benchmarks available on request)

SRE time doing initial triage using
dashboards
SRE time spent in automate-able root
cause analysis on the CLI
SRE time spent in remediation (various)
52%
SRE time spent
on prod alerts
and tickets
* Joint customer study based on 2000 production alerts/tickets, validated with 8+ interviews
Engineering Assistants for production reliability
(business case templates and benchmarks available on request)
-Collect and summarize hundreds
of health checks before an expert
is hands-on-keyboard
-Reduce leakage of production
credentials while giving broader
access to “safe” automation

Let us help build your ﬁrst three
Engineering Assistants as a PoC.
We typically get a ﬁrst Engineering Assistant
running in your environment in an hour so your
platform team can check it out*.
In a workshop a few weeks later, we build the next
two Assistants for your team to give to developers,
SREs, QA, etc.
* The default Assistant uses tasks that only access to the Kubernetes API server, typically with ClusterView or single namespace
read-only permissions. Subsequent Assistants use a broader set of tasks, integrating with more tools in your environment

THANK
YOU

GDG Cloud Southlake #36: Kyle Forster: AI and Modern Workflow Automation: Automating Runbooks

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

GDG Cloud Southlake #36: Kyle Forster: AI and Modern Workflow Automation: Automating Runbooks

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 27

Slide 28

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx