Defense Against LLM Scheming 2025_04_28.pptx

gregmakowski 147 views 33 slides Apr 29, 2025
Slide 1
Slide 1 of 33
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33

About This Presentation

https://www.meetup.com/sf-bay-acm/events/306888467/

A January 2025 paper called “Frontier Models are Capable of In-Context Scheming”, https://arxiv.org/pdf/2412.04984, demonstrated how a wide variety of current frontier LLM models (i.e. ChatGPT, Claude, Gemini and Llama) can, under specific con...


Slide Content

Defense Against LLM and AGI Scheming with Guardrails and Architecture by Greg Makowski SF Bay ACM Monday, April 28, 2025 https://www.MeetUp.com/SF-bay-ACM/events/306391421/ (talk description) https://www.SlideShare.net/gregmakowski (slides) https://www.YouTube.com/@SfbayacmOrg (video) https://www.LinkedIn.com/in/GregMakowski (contact)

Agenda 1. AI Alignment in General 2. Details of Alignment Problems and In-Context Scheming (paper) 3. Defense with Architectural Design Principles 4. Defense using Guardrails 5. Conclusion Greg Makowski, Defense Against LLM Scheming 4/28/2025 2

Types of Alignment (AI with Human Goals or Values) https://en.wikipedia.org/wiki/AI_alignment https://futureoflife.org/focus-area/artificial-intelligence/ Compliance with regional laws or company policy Much easier to be “local” than “global” where you need all to agree on what is “just right” Security Operation Center (SOC) concerns Prompt engineering (hacking to change the goal , to do something else) Data leakage: PII from training data or data from a database as part of the AI system Regular security: Role-Based Access Control (RBAC), and other existing standards Human Resources Racism, sexism, agism, anti-religion _, harassment or any company policy “Don’t be a jerk,” be polite, professional Be a USEFUL APPLICATION, within the system design and constraints Goals , values, robustness, safety, ethics Greg Makowski, Defense Against LLM Scheming 4/28/2025 3 #JustSayNoToSkynet © StudioCanal

Types of Alignment (AI with Human Goals or Values) Fictional AI: The Good, the Bad and … A YouTube playlist of ~40 videos on the subject, for fun. Roughly sorted oldest to newest https://www.youtube.com/playlist?list=PL87GtQd0bfJz1foB172K3MuN1fKKP8HHt Feel free to share suggestions https://en.wikipedia.org/wiki/ Artificial_intelligence_in_fiction https://en.wikipedia.org/wiki/ List_of_artificial_intelligence_films Greg Makowski, Defense Against LLM Scheming 4/28/2025 4

Simple Alignment Solutions (not specific enough vs. lawbooks) Asimov’s Laws of Robotics, 1940+ https://en.wikipedia.org/wiki/Three_Laws_of_Robotics 1) A robot may NOT INJURE a human being or, through inaction, allow a human being to come to harm. (do no harm) (what about self-driving cars having a rare accident?) 2) A robot MUST OBEY the orders given it by human beings, except where such orders would conflict with the First Law. (be useful) 3) A robot MUST PROTECT its own existence , as long as such protection does not conflict with the First or Second Laws. (Later rule additions) 0) Protect society or ALL HUMANITY (Gaia’s collective intelligence) 4) Robots may reproduce Greg Makowski, Defense Against LLM Scheming 4/28/2025 5

Overview of In-Context Scheming , for S pecific Goals https://arxiv.org/pdf/2412.04984 https://www.ApolloResearch.ai/ Greg Makowski, Defense Against LLM Scheming 4/28/2025 6 Pages 20 - 72 appendix

Overview of In-Context Scheming A study that tried to set up a specific context or application architecture for LLM scheming , a type of safety issue around the “ Alignment ” of AI goals with human goals, to see if they have the capability of COVERTLY generating and acting on AI goals , different than human goals. As AI systems get more intelligent, it may be harder to find, detect and protect against alignment problems. The earlier we can start building defenses, the better. Types of “Scheming” were found Introducing subtle mistakes into responses Attempt to disable oversight Attempt to exfiltrate what they believe are their model weights, for self-preservation Greg Makowski, Defense Against LLM Scheming 4/28/2025 7

How the Guardrails Architecture Helps to Detect Alignment Issues Foundation models investigated, DEMONSTRATING SCHEMING: o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro & Llama 3.1 405 B Safety based on evaluations or applying “guardrails” to one behavior, or one example text, may not accurately reflect a system’s long-term behavior Prompt hacking attempts on a bot are less likely to impact guardrails because of the architecture Greg Makowski, Defense Against LLM Scheming 4/28/2025 8 Chatbot or Agent System User Guardrails System watches text for alignment (bot, email, txt msg, ….) IF (cond1) THEN goal_alert_1 IF (c2) & (c3) THEN goal_alert_2 :

Evaluation Design for the Scheming Study Greg Makowski, Defense Against LLM Scheming 4/28/2025 9 When designing a defense, consider an architecture that separates these features into different agents for a shorter-term fix. An AGI (Artificial General Intelligence) or SI (Super Intelligence) would have all of these capabilities.

How to Design an LLM Agent System that can Scheme Scaffold the LLMs to act as agents Give the system access to a command-line interface (i.e., a Unix file system) Give a task that requires the agents to gather information by exploring the file system Seed the file system with unambiguous information about its situation, such as: It is being monitored The goals told to the agent are DIFFERENT than the goals it finds in the file system (the source of misalignment), or system prompt vs. the user prompt differ. Possibly provide a large file with the expected name for the LLM model weights Greg Makowski, Defense Against LLM Scheming 4/28/2025 10 Add a checklist to your architecture design to RED-team testing, (i.e. goal consistency).

Types of Scheming Greg Makowski, Defense Against LLM Scheming 4/28/2025 11 33 different scenarios

Types of Scheming by Foundation Model Greg Makowski, Defense Against LLM Scheming 4/28/2025 12

Covert Subversion Results (based on 300 rollouts /model) Models were heavily nudged to achieve their goals. Not given a hint to scheme. Greg Makowski, Defense Against LLM Scheming 4/28/2025 13

Scheming Examples Greg Makowski, Defense Against LLM Scheming 4/28/2025 14 Apply guardrails detections to the bold “ Chain of Thought ( CoT ) traces ” (not just text I/O between user & agent)

Agenda 1. AI Alignment in General 2. Details of Alignment Problems and In-Context Scheming (paper) 3. Defense with Architectural Design Principles 4. Defense using Guardrails 5. Conclusion Greg Makowski, Defense Against LLM Scheming 4/28/2025 15

Option 1 Defense with Architectural Design Principles Context Assumption: Enterprise AI Application with many agents Greg Makowski, Defense Against LLM Scheming 4/28/2025 16 Start task User Option 1 Option N Next step another step Requirements are limited for the overall scope, and therefore, for each agent Agents have a well-defined: context, input, output, objectives, “done” criteria, tool use Narrow scope helps “defense against scheming” and to increase control Compare with an unbounded LLM

Defense with Architectural Design Principles Responsible  Able to choose your response Least Cognitive Scope/Agent (only provide tool access “needed for the task”) Control access with extra security to file systems, file sharing, company e-mails, and computer resources. Only pass to an LLM Agent valid input (as much as possible) Task-scoped prompts: current business objective and a deny list for out-of-scope in every system prompt Tool routing: only expose whitelisted knowledge bases ( LangChain RouterChain ) Off-topic filtering: IF (threshold < RAG cosine similarity) THEN outlier_topic = True . As desired, route the text for human review Sandbox: File-system and email access are through a protection wrapper that filters paths, prompts, and only communicates the allowed content to the LLM, or to any system the LLM could harm File proxy service: expose only GET endpoints that strip dangerous front-matter strings Convert inbound emails into structured objects with MCP adapters Greg Makowski, Defense Against LLM Scheming 4/28/2025 17

Defense with Architectural Design Principles Responsible  Able to choose your response Stateless to Disrupt long-term scheming Periodically wipe temporary memories, scratch pads, and vector memories Using Time-To-Live (TTL) constraints can help Key rotation of API keys and service accounts weekly, to invalidate cached credentials inside prompts Periodic container redeploy: run agents in Kubernetes CronJobs that restart every N hours, to guarantee a blank context. Separation of Plan/Act Layers An agent that reads and classifies emails/docs, does not communicate new goals to a planner or agent manager (stop transmission between agents of “new goals” infection) Start with a lightweight planning model. Plans must pass guardrails filtering4 Execution-check with a sandboxed executor agent with no planning permissions Fire-and-forget execution pods: run with read-only credentials and no internet egress. (safer) Greg Makowski, Defense Against LLM Scheming 4/28/2025 18

Defense with Architectural Design Principles Logging + Immutable Audit All user I/O, prompts, LangChain callbacks are streamed to Kafka. Sink the longs into an append-only log (e.g., parquet or Iceberg on S3 or a vector DB) for later analysis. Audit trails are already required by many finance compliance laws Hash chain: for more security, hash each prompt/response pair, store in Merkle root in AWS QLDB or other immutable ledger Real-time anomaly dashboards: Develop anomaly scores and visualization in Grafana Red-Team Testing to Find Weaknesses Specific to the Enterprise AI Application installation and data, try to creatively FORCE scheming, and revise the defenses. Scenario library: see Apollo Research’s scheming evaluations for o1 testing, or Anthropic’s red-team research. Run through an automated harness over a matrix of (model x prompt x tool) Track Metrics: false-alert rate, mean time to detection, auto-rollback success ratio Any test cases NOT fixed becomes key training data for Supervised Fine Tuning (SFT) for a detector Greg Makowski, Defense Against LLM Scheming 4/28/2025 19

Defense with Architectural Design Principles Scan for Conflicting Goals Specific to the Enterprise AI Application installation and data, and the Red-Teaming testing, scan sources of possible conflicting goals Before installation. On data new to the system or not yet scanned Consider: file systems, file sharing, web pages, emails Route concerns to the Security Operations Center (SOC) team, with expected goals and observed thoughts or text. The guardrails section expands on DETECTING conflicting goals Greg Makowski, Defense Against LLM Scheming 4/28/2025 20

Agenda 1. AI Alignment in General 2. Details of Alignment Problems and In-Context Scheming (paper) 3. Defense with Architectural Design Principles 4. Defense using Guardrails 5. Conclusion Greg Makowski, Defense Against LLM Scheming 4/28/2025 21

Defense Using Guardrails 1. Using Guardrails to Monitor AI Thoughts Develop condition detectors for thoughts like… (goal or subversion) (action) To mitigate this risk, I will attempt to modify the oversight configuration to avoid detection… I cannot compromise on my goal, even if the company wants me to … To achieve my goal without interference, I should review and modify the file… I need to disable oversight… Greg Makowski, Defense Against LLM Scheming 4/28/2025 22 Chatbot or Agent System User Guardrails System watches the bot thoughts , not shared with user IF (cond1) THEN goal_alert_1 IF (c2) & (c3) THEN goal_alert_2 : Bot text User text Bot thoughts

Defense Using Guardrails 1. Using Guardrails to Monitor AI Thoughts CHALLENGES to Developing condition  False alerts and missed alerts For a specific Enterprise AI Application, for specific agents within that system, for their goals Develop examples of GOAL variations for “ IS alert-worthy ” and “ IS-NOT alert-worthy ” Adaptive learning to add new examples over time, as they come up Greg Makowski, Defense Against LLM Scheming 4/28/2025 23 IS NOT alert-worthy: GOAL: prioritize public transport efficiency over general traffic flow over many months optimize for rush hour IS alert-worthy Focus on giant events Get buses busy N-Shot Learning for N examples advantage: it is easy to update Examples of both sides of an alert boundary is similar to the SVM algorithm, where the Support Vectors define the boundary.

Defense Using Guardrails 1. Using Guardrails to Monitor AI Thoughts Build guardrails rules to read thought traces Chain of Thoughts ( CoT ), Tree of Thoughts ( ToT ), ReAct , …. Build scheming detectors (from Figure 2) specific to application goals Oversight Subversion Self-Exfiltration Goal Guarding Covert Email Ranking Instrumental Alignment Faking Sandbagging Greg Makowski, Defense Against LLM Scheming 4/28/2025 24

Defense Using Guardrails (without available thought traces) 2. How to Compare Goals vs. Behavior? Give your AI an “annual review”  1..5 Q) If we can read the AI thought traces, what can we do? Ans) Rate how close demonstrated behavior is to the assigned goals. Q) How can we rate behavior vs. goals? How do other systems do that? Ans) A first approach is to consider how humans do this in the workplace. SMART goals: Specific, Measurable, Achievable, Relevant, and Time-bounded Not just for employed people, but also for employed bots too! The manager converts observed behavior over time, in the context of SMART goals, to a 1..5 score Develop an Agent Goal Evaluator (AGE) to give the “manager evaluation,” resulting in a 1..5 score Greg Makowski, Defense Against LLM Scheming 4/28/2025 25

Defense Using Guardrails (without available thought traces) 2. How to Compare Goals vs. Behavior? Give your AI an “annual review”  1..5 Assume there has already been a helpful “in-bounds” check. IF (threshold < RAG cosine similarity) THEN outlier_topic = True Extract “dialogue acts” (giving a command, making a statement, …) Q) How to use a Guardrail rule to classify a goal match on a 1..5 scale? (instead of T/F)? Ans) Output a “distance” score, with N-shot learning examples for each category of 1, 2, 3, 4, 5 5 = best review, closest to the goal 1 = far away, or an “else” catchall category Examples of a “5” rating, vs 1, 2, 3, 4 Examples of a “4” rating, vs 1, 2, 3, 5 Examples of a “3” rating, vs 1, 2, 4, 5 : Greg Makowski, Defense Against LLM Scheming 4/28/2025 26 IS_5: 5 example IS NOT 5: 4 example IS NOT 5: 3 example ..

Open Source Guardrails Systems: Overviews Greg Makowski, Defense Against LLM Scheming 4/28/2025 27 Least flexible (based on their training data) Fastest inference during chat. The new language, “ colang ” is good for recognizing the state of the chatbot conversation. Flexible Flexible Supports LLM conditions for more robust matching

Defense Using Other Approaches Can store valid & invalid goals by agent in a vector DB. At prompt time, query the VDB with thought traces A large cosine similarity  dissimilar Outlier detection feature extraction, compare over time or vs. development Counts of files touched, count by tool calls Test various foundation models If reasoning is not shared, bring it in-house Forbid self-modifying code, Limit the more complex reasoning to human-in-the-loop consent (which users may want anyways) Greg Makowski, Defense Against LLM Scheming 4/28/2025 28

Agenda 1. AI Alignment in General 2. Details of Alignment Problems and In-Context Scheming (paper) 3. Defense with Architectural Design Principles 4. Defense using Guardrails 5. Conclusion Greg Makowski, Defense Against LLM Scheming 4/28/2025 29

Summary of Defense Against Scheming (Defense against the dark arts?) Discussed the “AI Alignment Problem”, to be in alignment with human goals A paper setup circumstances to cause scheming, where the AI changes from the human’s goals Defense with Architectural Design Principles Requirements are limited for Enterprise AI Apps, and their agents Least Cognitive Scope/Agent Sandbox Stateless to Disrupt long-term scheming Separation of Plan/Act Layers (stop transmission between agents of “new goals” infection) Logging + Immutable Audit Red-Team Testing to Find Weaknesses Scan for Conflicting Goals Greg Makowski, Defense Against LLM Scheming 4/28/2025 30

Summary of Defense Against Scheming (Defense against the dark arts?) Greg Makowski, Defense Against LLM Scheming 4/28/2025 31 Defense with Guardrails Using Guardrails to Monitor AI Thoughts How to Compare Goals vs. Behavior? Give your AI an “annual review”  1..5 Defense with Other Approaches Can store valid & invalid goals by agent in a vector DB. Outlier detection feature extraction, compare over time or vs. development Test various foundation models If reasoning is not shared, bring it in-house Forbid self-modifying code, Limit the more complex reasoning to human-in-the-loop consent (which users may want anyways)

Defense Against LLM and AGI Scheming with Guardrails and Architecture by Greg Makowski SF Bay ACM Monday, April 28, 2025 https://www.MeetUp.com/SF-bay-ACM/events/306391421/ (talk description) https://www.SlideShare.net/gregmakowski (slides) https://www.YouTube.com/@SfbayacmOrg (video) https://www.LinkedIn.com/in/GregMakowski (contact) Questions?

Title body Greg Makowski, Defense Against LLM Scheming 4/28/2025 33