GDG Cloud Southlake #36: Kyle Forster: AI and Modern Workflow Automation: Automating Runbooks

JamesAnderson135 325 views 28 slides Sep 30, 2024
Slide 1
Slide 1 of 28
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28

About This Presentation

AI and Modern Workflow Automation: Automating Runbooks

Kyle will talk about the journey of turning a design inspired by Google's runbook automation system into an enterprise product, and how a weekend with LLMs caused them to re-imagine the design from the ground up. Using examples from their ...


Slide Content

Run
when
“This is the most practical
application of AI for DevOps that I
have seen so far, and it is working
for us today”
- Dir Platform Engineering, Fortune 25
Health/Retail


100x faster than traditional
automation initiatives, 1/5th the
budget of low-cost outsourcing


Team from Google (Kubernetes)


Investors include current and former
CIO/CTO/VPs of Google, Netflix,
LinkedIn, Goldman Sachs, Comcast
and Uber


Engaged with Global 2000
enterprises in financial services,
communication service provider,
and SaaS verticals


Signed partnerships with key
Systems Integrators, including 6x
GCP partner-of-the-year

Thank you,
Jim and the GDG
Southlake crew…
Alyssa Hamulak, Nitin Raut,
Mallikarjun Dontula,
Diwakar Pandrangi,
Mike Shirk, Kenny Kon,
Yujun Liang, Ramji Bala

K8s Architecture Plan - Landing Zone
10:00-10:50, LC-52a, 12 attendees accepted
Ahmad Hassan
In #test-env
Can someone restart cart svc in test? URGENT
Li Wang
In #test-env
@Ahmad - Getting 500s from cart-service …
Ahmad Hassan
In #platform
#channel - How do I restart cart pods? URGENT!
Michael Shannon (dev)
In office today? #prod-esc (again).. URGNT
1 hour ago
48 mins ago
35 mins ago
34 mins ago
alertmanager-test-alerts
In #test-environment
Incident #0.gnslwfplaa is ongoing
2 hours ago
20 mins ago
18 mins ago
alertmanager-test-alerts
In #test-environment
Incident #0.gnslwfplaa is ongoing
2 hours ago
alertmanager-test-alerts
In #test-environment
Incident #0.gnslwfplaa is ongoing
2 hours ago
You can’t
automate this
But you could have
automated that
alertmanager-test-alerts
In #test-env-alerts
Incident #0.gnslwfplaa is ongoing (54 MINUTES)
22 mins ago
Sara Foster
Any chance that cart-postgres is full again?
One of my tests just started failing. Prob DB.

Lean teams are here to stay.
Most teams working with us are 20-40% smaller
than they were four years ago. They are left with a
lot of work to automate, but nobody has the time.

When you can’t add more engineers,
build Engineering Assistants.
100x faster than traditional automation initiatives and
~1/5th the budget of low cost outsourcing, built for the
work your team would automate if they had the time.

How many developers can your platform team
support before you need to add L0/L1 headcount?
Without Engineering
Assistants (before)
25
With Engineering
Assistants for devs (after)

110
A Tier-1 Telco platform team is providing
Assistants to developers to speed up a
mass migration to GKE
A Fortune 50 ops team is hiring expert
SREs with Assistants to replace $800m
of outsourced L0/L1 support
Outsourced
L0/L1
support
(before)

Expert SREs
with Assistants
(after)
First Project: $600k/yr cost
savings, 78% reduction in MTTR
and a high-end team
In Dev/Test In Production

Execs have executive assistants, engineers should have
engineering assistants… what could one do for you?
DEVPLATFORM PROD
-Troubleshoot dependencies
outside of my code
-Collect relevant logs and state
outside of my code when my
tests fail
-Sanity check my manifests
-Enrich tickets assigned to me
with logs, env vars, etc.
-Bump CPU/ Mem/ Storage/
Replicas when I need it
-Collect info for devs’
(repetitive) troubleshooting
-Triage noisy alerts in the
test env alerts
-Help with basic resource
right-sizing
-Sanity check manifests
-Troubleshoot and re-run
flaky CI/CD jobs
-Triage test and production
alerts
-Collect info and route an
escalation to the right person
-Run broad health checks across
many components for root
cause analysis
-Collect reliability analytics
-Repetitive remediations
(restarts, expand storage,
right-size replicas…)
Intuition:
delegate
tasks that
can be done
on the CLI

Goal driven automation is 100x faster
Instead of writing code to automate a workflow, engineers
sync an environment with experts’ libraries and give their
Engineering Assistant a “goal.”
The Assistant runs automated steps from the libraries to
reach the goal, escalating if they can’t take the next step.
There is so little code/configuration required that most
teams have their first Assistant running in under an hour.

K8s AI Architecture Plan
10:00-10:50, LC-52a, 12 attendees accepted
Your Engineering
Assistant did this…
…so you can do that
Edgar

Your Engineering
Assistant did this…
…your execs
get that
Edgar

Your Engineering
Assistant did this…
Edgar
$
…the author
received that*
* Applicable for public automation libraries

Execute safe restart of Kubernetes
deployment cart
Check if cart-svc memory was
above 80% in the last 5 mins
Search cart-api logs for
java error messages
Production-Pali monitors cloud infra error
budgets for ecom-staging and can
reboot out-of-memory VMs or add storage
capacity before users are impacted
Test-Tania responds to #oncall-tst
non-prod alerts and runs hourly health
checks across the entire test environment
Eager-Edgar helps developers by running
diagnostics for their kubernetes
deployments in dev-dk8s-w1
Sync a cluster/cloud with RunWhen libraries to build
your first Engineering Assistant in one hour

Production-Pali monitors cloud infra error
budgets for ecom-staging and can
reboot out-of-memory VMs or add storage
capacity before users are impacted
Test-Tania responds to #oncall-tst
non-prod alerts and runs hourly health
checks across the entire test environment
Eager-Edgar helps developers by running
diagnostics for their kubernetes
deployments in dev-dk8s-w1
A roadmap to get your team the time to build high-impact
automation across environments
Saves enough time
to build the next…
…and the next…
…and the next

This team increased feature velocity by
15% while sharing L0/L1 support thanks
to unified automation across dev,
platform and ops engineers
Collaborating Across Dev, Test And Production
-Dev → Prod: Automation to
diagnose failed tests was
re-used to triage alerts in
production
-Prod → Dev: Automation to
collect status during incidents
was re-used to maintain the
shared test environment
Story points per sprint dropped as dev
team started sharing L0/L1 support…
… and came back to a new high
thanks to increased automation

My new code
worked in the
first few tries
I ran into
issues inside
of my code
I ran into
issues outside
of my code
Writing new code (IDE)Troubleshooting (CLI)
Developer time
spent per pull
request
* RunWhen dev team study, See also IEEE / Microsoft “Today Was A Good Day”
8x
1.2x
Engineering Assistants for developer productivity
(business case templates and benchmarks available on request)
-Dev/Test Engineering Assistants
mitigate these scenarios where
most productivity is lost
-Increase dev velocity AND reduce
escalations to platform/ops teams
-Automation started here is
re-usable for production L0/L1
support

My new code
worked in the
first few tries
I ran into
issues inside
of my code
I ran into
issues outside
of my code
Writing new code (IDE)Troubleshooting (CLI)
Developer time
spent per pull
request
* RunWhen dev team study, See also IEEE / Microsoft “Today Was A Good Day”
8x
1.2x
Engineering Assistants for developer productivity
(business case templates and benchmarks available on request)

-The developer didn’t end up
blocked for hours on an issue
outside of their code
-The developer didn’t escalate to
the platform/devops team for help
-The developer didn’t make a fuss
about docs that weren’t updated
(which they wouldn’t have found
anyways)

Engineering Assistants for test/staging environments
(business case templates and benchmarks available on request)
“We saved so much
on our non-prod
observability costs
that it paid for the
entire RunWhen
deployment…”
-VP Engineering, AI Startup
-Broad range of automated health
measurement, detailed diagnostics,
root cause and remediation for
Kubernetes apps out-of-the-box
-Executive reporting for Service
Health and Operational Readiness
reporting

Engineering Assistants for test/staging environments
(business case templates and benchmarks available on request)
Cost Savings: Saved $12k/month of non-production logging costs
by capturing logs directly from pods on alerts/tickets/requests
Velocity: Assistant copied logs, env vars and service status to a ticket
before restarting services, reducing Dev <> QA friction
Collaboration: Dev and QA teams contributed automated health
checks specific to the application that flowed through to production
for use by SREs
Test Tania Joined
This Assistant handles alerts and failed test webhooks
in the test environment, and can do basic remediation

* RunWhen survey, n=127, 2023
What are your top “keep the lights on” tasks that your
team would automate if you had the time?
Helping devs with repetitive
troubleshooting (over slack)
60%
Triaging noisy alerts in
test environments

30%
Fixing basic errors in
devs’ manifests

10%
This Assistant listens in slack to
direct engineers to the right tasks
to run in the test environment
Engineering Assistants for test/staging environments
(business case templates and benchmarks available on request)

SRE time doing initial triage using
dashboards
SRE time spent in automate-able root
cause analysis on the CLI
SRE time spent in remediation (various)
52%
SRE time spent
on prod alerts
and tickets
* Joint customer study based on 2000 production alerts/tickets, validated with 8+ interviews
Engineering Assistants for production reliability
(business case templates and benchmarks available on request)
-Collect and summarize hundreds
of health checks before an expert
is hands-on-keyboard
-Reduce leakage of production
credentials while giving broader
access to “safe” automation

Let us help build your first three
Engineering Assistants as a PoC.
We typically get a first Engineering Assistant
running in your environment in an hour so your
platform team can check it out*.
In a workshop a few weeks later, we build the next
two Assistants for your team to give to developers,
SREs, QA, etc.
* The default Assistant uses tasks that only access to the Kubernetes API server, typically with ClusterView or single namespace
read-only permissions. Subsequent Assistants use a broader set of tasks, integrating with more tools in your environment

THANK
YOU