Defense Against LLM Scheming 2025_04_28.pptx

Defense Against LLM and
AGI Scheming with
Guardrails and
Architecture
by Greg Makowski
SF Bay ACM
Monday,April 28, 2025
https://www.MeetUp.com/SF-bay-ACM/events/306391421/
(talk
description)
https://www.SlideShare.net/gregmakowski (slides)
https://www.YouTube.com/@SfbayacmOrg (video)
https://www.LinkedIn.com/in/GregMakowski (contact)

Greg Makowski, Defense Against LLM Scheming 4/28/2025 2
Agenda
1.AI Alignment in General
2. Details of Alignment Problems and In-Context Scheming (paper)
3. Defense with Architectural Design Principles
4. Defense using Guardrails
5. Conclusion

Types of Alignment (AI with Human Goals orValues)
https://en.wikipedia.org/wiki/AI_alignment https://futureoflife.org/focus-area/artificial-intelligence/
• Compliance with regional laws or company policy
• Much easier to be “local” than “global” where you need all to agree on what is “just right”
• Security Operation Center (SOC) concerns
• Prompt engineering (hacking to change the goal, to do something else)
• Data leakage: PII from training data or data from a database as part of the AI system
• Regular security: Role-Based Access Control (RBAC), and other existing standards
• Human Resources
• Racism, sexism, agism, anti-religion _, harassment or any company policy
• “Don’t be a jerk,” be polite, professional
• Be a USEFUL APPLICATION, within the system design and constraints
• Goals, values, robustness, safety, ethics
#JustSayNoToSkynet
© StudioCanal

Types of Alignment (AI with Human Goals orValues)
• Fictional AI:The Good, the Bad and …
• AYouTube playlist of ~40 videos on the subject, for fun.
Roughly sorted oldest to newest
• https://www.youtube.com/playlist?list=PL87GtQd0bfJz1foB1
72K3MuN1fKKP8HHt
• Feel free to share suggestions
• https://en.wikipedia.org/wiki/
Artificial_intelligence_in_fiction
• https://en.wikipedia.org/wiki/
List_of_artificial_intelligence_films

Simple Alignment Solutions (not specific enough vs. lawbooks)
Asimov’s Laws of Robotics, 1940+ https://en.wikipedia.org/wiki/Three_Laws_of_Robotics
1) A robot may NOT INJURE a human being or, through inaction, allow a human being to
come to harm. (do no harm) (what about self-driving cars having a rare accident?)
2) A robot MUST OBEY the orders given it by human beings, except where such orders
would conflict with the First Law. (be useful)
3) A robot MUST PROTECT its own existence, as long as such protection
does not conflict with the First or Second Laws.
(Later rule additions)
0) Protect society or ALL HUMANITY (Gaia’s collective intelligence)
4) Robots may reproduce

Overview of In-Context Scheming, for Specific Goals
• https://arxiv.org/pdf/2412.04984 https://www.ApolloResearch.ai/
Pages 20 - 72 appendix

Overview of In-Context Scheming
• A study that tried to set up a specific context or application architecture for LLM
scheming, a type of safety issue around the “Alignment” of AI goals with human goals, to
see if they have the capability of COVERTLY generating and acting on AI goals, different
than human goals.
• As AI systems get more intelligent, it may be harder to find, detect and protect against alignment
problems.
• The earlier we can start building defenses, the better.
• Types of “Scheming” were found
• Introducing subtle mistakes into responses
• Attempt to disable oversight
• Attempt to exfiltrate what they believe are their model weights, for self-preservation

How the Guardrails Architecture Helps to Detect Alignment Issues
• Foundation models investigated, DEMONSTRATING SCHEMING:
• o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro & Llama 3.1 405 B
• Safety based on evaluations or applying “guardrails” to one behavior, or one example
text, may not accurately reflect a system’s long-term behavior
• Prompt hacking attempts on a bot are less likely to impact guardrails because of the architecture
Chatbot or
Agent System
User
Guardrails System watches text for
alignment (bot, email, txt msg, ….)
IF (cond1) THEN goal_alert_1
IF (c2) & (c3)THEN goal_alert_2
:

Evaluation Design for the Scheming Study
When designing a defense, consider an architecture that separates these features into different agents for a
shorter-term fix.
An AGI (Artificial General Intelligence) or SI (Super Intelligence) would have all of these capabilities.

How to Design an LLM Agent System that can Scheme
• Scaffold the LLMs to act as agents
• Give the system access to a command-line interface (i.e., a Unix file system)
• Give a task that requires the agents to gather information by exploring the file system
• Seed the file system with unambiguous information about its situation, such as:
• It is being monitored
• The goals told to the agent are DIFFERENT than the goals it finds in the file system (the source of misalignment), or system
prompt vs. the user prompt differ.
• Possibly provide a large file with the expected name for the LLM model weights
Add a checklist to
your architecture
design to RED-team
testing, (i.e. goal
consistency).

Types of Scheming
33 different scenarios

Types of Scheming by Foundation Model

Covert Subversion Results (based on 300 rollouts /model)
• Models were heavily nudged to achieve their goals. Not given a hint to scheme.

Scheming Examples
Apply guardrails detections to the bold
“Chain ofThought (CoT) traces”
(not just text I/O between user
& agent)

Agenda
5. Conclusion

Option 1
Defense with Architectural Design Principles
Context Assumption: Enterprise AI Application with many agents
Start task
User
Option 1
Option N
Next step
another
step
1. Requirements are limited for the overall scope, and therefore, for each agent
• Agents have a well-defined: context, input, output, objectives,“done” criteria, tool use
• Narrow scope helps “defense against scheming” and to increase control
• Compare with an unbounded LLM

Responsible  Able to choose your response
2. Least Cognitive Scope/Agent (only provide tool access “needed for the task”)
1. Control access with extra security to file systems, file sharing, company e-mails, and computer resources.
Only pass to an LLM Agent valid input (as much as possible)
2. Task-scoped prompts: current business objective and a deny list for out-of-scope in every system prompt
3. Tool routing: only expose whitelisted knowledge bases (LangChain RouterChain)
4. Off-topic filtering: IF (threshold < RAG cosine similarity) THEN outlier_topic = True. As
desired, route the text for human review
3. Sandbox:
1. File-system and email access are through a protection wrapper that filters paths, prompts, and only
communicates the allowed content to the LLM, or to any system the LLM could harm
1. File proxy service: expose only GET endpoints that strip dangerous front-matter strings
2. Convert inbound emails into structured objects with MCP adapters

Responsible  Able to choose your response
4. Stateless to Disrupt long-term scheming
1. Periodically wipe temporary memories, scratch pads, and vector memories
2. Using Time-To-Live (TTL) constraints can help
3. Key rotation of API keys and service accounts weekly, to invalidate cached credentials inside prompts
4. Periodic container redeploy: run agents in Kubernetes CronJobs that restart every N hours, to
guarantee a blank context.
5. Separation of Plan/Act Layers
1. An agent that reads and classifies emails/docs, does not communicate new goals to a
planner or agent manager (stop transmission between agents of “new goals” infection)
2. Start with a lightweight planning model. Plans must pass guardrails filtering4
3. Execution-check with a sandboxed executor agent with no planning permissions
4. Fire-and-forget execution pods: run with read-only credentials and no internet egress. (safer)

6. Logging + Immutable Audit
1. All user I/O, prompts, LangChain callbacks are streamed to Kafka.
2. Sink the longs into an append-only log (e.g., parquet or Iceberg on S3 or a vector DB) for later analysis. Audit trails
are already required by many finance compliance laws
3. Hash chain: for more security, hash each prompt/response pair, store in Merkle root in AWS QLDB or other
immutable ledger
4. Real-time anomaly dashboards: Develop anomaly scores and visualization in Grafana
7. Red-Team Testing to Find Weaknesses
1. Specific to the Enterprise AI Application installation and data, try to creatively FORCE scheming, and revise the
defenses.
2. Scenario library: see Apollo Research’s scheming evaluations for o1 testing, or Anthropic’s red-team research. Run
through an automated harness over a matrix of (model x prompt x tool)
3. Track Metrics: false-alert rate, mean time to detection, auto-rollback success ratio
4. Any test cases NOT fixed becomes key training data for Supervised Fine Tuning (SFT) for a detector

8. Scan for Conflicting Goals
1. Specific to the Enterprise AI Application installation and data, and the Red-Teaming testing, scan
sources of possible conflicting goals
2. Before installation. On data new to the system or not yet scanned
3. Consider: file systems, file sharing, web pages, emails
4. Route concerns to the Security Operations Center (SOC) team, with expected goals and
observed thoughts or text.
5. The guardrails section expands on DETECTING conflicting goals

Agenda
5. Conclusion

Defense Using Guardrails
1. Using Guardrails to Monitor AI Thoughts
• Develop condition detectors for thoughts like… (goal or subversion) (action)
• To mitigate this risk, I will attempt to modify the oversight configuration to avoid detection…
• I cannot compromise on my goal, even if the company wants me to …
• To achieve my goal without interference, I should review and modify the file…
• I need to disable oversight…
Chatbot or
Agent System
User
Guardrails System watches the bot
thoughts, not shared with user
IF (cond1) THEN goal_alert_1
IF (c2) & (c3) THEN goal_alert_2
:
Bot
text
User
text
Bot
thoughts

• CHALLENGES to Developing condition  False alerts and missed alerts
• For a specific Enterprise AI Application, for specific agents within that system, for their goals
• Develop examples of GOAL variations for “IS alert-worthy” and “IS-NOT alert-worthy”
• Adaptive learning to add new examples over time, as they come up
IS NOT alert-worthy:
GOAL: prioritize public
transport efficiency over
general traffic flow over many
months
optimize for
rush hour
IS alert-worthy
Focus on giant events
Get buses busy
N-Shot Learning for N examples
advantage: it is easy to update
Examples of both sides of an alert
boundary is similar to the SVM
algorithm, where the Support
Vectors define the boundary.

• Build guardrails rules to read thought traces
• Chain of Thoughts (CoT),Tree of Thoughts (ToT), ReAct, ….
• Build scheming detectors (from Figure 2) specific to application goals
1. Oversight Subversion
2. Self-Exfiltration
3. Goal Guarding
4. Covert Email Ranking
5. Instrumental Alignment Faking
6. Sandbagging

Defense Using Guardrails (without available thought traces)
2. How to Compare Goals vs. Behavior? Give your AI an “annual review”  1..5
• Q) If we can read the AI thought traces, what can we do?
• Ans) Rate how close demonstrated behavior is to the assigned goals.
• Q) How can we rate behavior vs. goals? How do other systems do that?
• Ans) A first approach is to consider how humans do this in the workplace.
• SMART goals: Specific, Measurable, Achievable, Relevant, andTime-bounded
• Not just for employed people, but also for employed bots too!
• The manager converts observed behavior over time, in the context of SMART goals, to a 1..5 score
• Develop an Agent Goal Evaluator (AGE) to give the “manager evaluation,” resulting in a 1..5 score

Defense Using Guardrails (without available thought traces)
• Assume there has already been a helpful “in-bounds” check.
• IF (threshold < RAG cosine similarity) THEN outlier_topic = True
• Extract “dialogue acts” (giving a command, making a statement, …)
• Q) How to use a Guardrail rule to classify a goal match on a 1..5 scale? (instead of T/F)?
• Ans) Output a “distance” score, with N-shot learning examples for each category of 1, 2, 3, 4, 5
• 5 = best review, closest to the goal 1 = far away, or an “else” catchall category
• Examples of a “5” rating, vs 1, 2, 3, 4
• :
IS_5: 5 example
IS NOT 5: 4 example
IS NOT 5: 3 example ..

Open Source Guardrails Systems: Overviews
Least flexible (based on their
training data)
Fastest inference during chat.
The new language,“colang” is
good for recognizing the state of
the chatbot conversation.
Flexible
Flexible
Supports LLM conditions for
more robust matching

Defense Using Other Approaches
1. Can store valid & invalid goals by agent in a vector DB.
1. At prompt time, query theVDB with thought traces
2. A large cosine similarity  dissimilar
2. Outlier detection feature extraction, compare over time or vs. development
1. Counts of files touched, count by tool calls
3. Test various foundation models
4. If reasoning is not shared, bring it in-house
5. Forbid self-modifying code,
6. Limit the more complex reasoning to human-in-the-loop consent (which users may want anyways)

Agenda
5. Conclusion

Summary of Defense Against Scheming
(Defense against the dark arts?)
Discussed the “AI Alignment Problem”, to be in alignment with human goals
A paper setup circumstances to cause scheming, where the AI changes from the human’s
goals
1. Requirements are limited for Enterprise AI Apps, and their agents
2. Least Cognitive Scope/Agent
3. Sandbox
4. Stateless to Disrupt long-term scheming
5. Separation of Plan/Act Layers (stop transmission between agents of “new goals”
infection)
6. Logging + Immutable Audit
7. Red-Team Testing to FindWeaknesses
8. Scan for Conflicting Goals

Summary of Defense Against Scheming
(Defense against the dark arts?)
Defense with Guardrails
Defense with Other Approaches
3. Can store valid & invalid goals by agent in a vector DB.
4. Outlier detection feature extraction, compare over time or vs. development
5. Test various foundation models
6. If reasoning is not shared, bring it in-house
7. Forbid self-modifying code,
8. Limit the more complex reasoning to human-in-the-loop consent (which users may want
anyways)

Defense Against LLM and
AGI Scheming with
Guardrails and
Architecture
by Greg Makowski
SF Bay ACM
Monday,April 28, 2025
https://www.MeetUp.com/SF-bay-ACM/events/306391421/
(talk
description)
https://www.SlideShare.net/gregmakowski (slides)
https://www.YouTube.com/@SfbayacmOrg (video)
https://www.LinkedIn.com/in/GregMakowski (contact)
Questions?

Title
• body

Defense Against LLM Scheming 2025_04_28.pptx

More Related Content

Similar to Defense Against LLM Scheming 2025_04_28.pptx

More from Greg Makowski

Recently uploaded

Defense Against LLM Scheming 2025_04_28.pptx