Defense Against LLM and
AGI Scheming with
Guardrails and
Architecture
by Greg Makowski
SF Bay ACM
Monday,April 28, 2025
https://www.MeetUp.com/SF-bay-ACM/events/306391421/
(talk
description)
https://www.SlideShare.net/gregmakowski (slides)
https://www.YouTube.com/@SfbayacmOrg (video)
https://www.LinkedIn.com/in/GregMakowski (contact)
Greg Makowski, Defense Against LLM Scheming 4/28/2025 2
Agenda
1.AI Alignment in General
2. Details of Alignment Problems and In-Context Scheming (paper)
3. Defense with Architectural Design Principles
4. Defense using Guardrails
5. Conclusion
Greg Makowski, Defense Against LLM Scheming 4/28/2025 3
Types of Alignment (AI with Human Goals orValues)
https://en.wikipedia.org/wiki/AI_alignment https://futureoflife.org/focus-area/artificial-intelligence/
• Compliance with regional laws or company policy
• Much easier to be “local” than “global” where you need all to agree on what is “just right”
• Security Operation Center (SOC) concerns
• Prompt engineering (hacking to change the goal, to do something else)
• Data leakage: PII from training data or data from a database as part of the AI system
• Regular security: Role-Based Access Control (RBAC), and other existing standards
• Human Resources
• Racism, sexism, agism, anti-religion _, harassment or any company policy
• “Don’t be a jerk,” be polite, professional
• Be a USEFUL APPLICATION, within the system design and constraints
• Goals, values, robustness, safety, ethics
#JustSayNoToSkynet
© StudioCanal
Greg Makowski, Defense Against LLM Scheming 4/28/2025 4
Types of Alignment (AI with Human Goals orValues)
• Fictional AI:The Good, the Bad and …
• AYouTube playlist of ~40 videos on the subject, for fun.
Roughly sorted oldest to newest
• https://www.youtube.com/playlist?list=PL87GtQd0bfJz1foB1
72K3MuN1fKKP8HHt
• Feel free to share suggestions
• https://en.wikipedia.org/wiki/
Artificial_intelligence_in_fiction
• https://en.wikipedia.org/wiki/
List_of_artificial_intelligence_films
Greg Makowski, Defense Against LLM Scheming 4/28/2025 5
Simple Alignment Solutions (not specific enough vs. lawbooks)
Asimov’s Laws of Robotics, 1940+ https://en.wikipedia.org/wiki/Three_Laws_of_Robotics
1) A robot may NOT INJURE a human being or, through inaction, allow a human being to
come to harm. (do no harm) (what about self-driving cars having a rare accident?)
2) A robot MUST OBEY the orders given it by human beings, except where such orders
would conflict with the First Law. (be useful)
3) A robot MUST PROTECT its own existence, as long as such protection
does not conflict with the First or Second Laws.
(Later rule additions)
0) Protect society or ALL HUMANITY (Gaia’s collective intelligence)
4) Robots may reproduce
Greg Makowski, Defense Against LLM Scheming 4/28/2025 6
Overview of In-Context Scheming, for Specific Goals
• https://arxiv.org/pdf/2412.04984 https://www.ApolloResearch.ai/
Pages 20 - 72 appendix
Greg Makowski, Defense Against LLM Scheming 4/28/2025 7
Overview of In-Context Scheming
• A study that tried to set up a specific context or application architecture for LLM
scheming, a type of safety issue around the “Alignment” of AI goals with human goals, to
see if they have the capability of COVERTLY generating and acting on AI goals, different
than human goals.
• As AI systems get more intelligent, it may be harder to find, detect and protect against alignment
problems.
• The earlier we can start building defenses, the better.
• Types of “Scheming” were found
• Introducing subtle mistakes into responses
• Attempt to disable oversight
• Attempt to exfiltrate what they believe are their model weights, for self-preservation
Greg Makowski, Defense Against LLM Scheming 4/28/2025 8
How the Guardrails Architecture Helps to Detect Alignment Issues
• Foundation models investigated, DEMONSTRATING SCHEMING:
• o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro & Llama 3.1 405 B
• Safety based on evaluations or applying “guardrails” to one behavior, or one example
text, may not accurately reflect a system’s long-term behavior
• Prompt hacking attempts on a bot are less likely to impact guardrails because of the architecture
Chatbot or
Agent System
User
Guardrails System watches text for
alignment (bot, email, txt msg, ….)
IF (cond1) THEN goal_alert_1
IF (c2) & (c3)THEN goal_alert_2
:
Greg Makowski, Defense Against LLM Scheming 4/28/2025 9
Evaluation Design for the Scheming Study
When designing a defense, consider an architecture that separates these features into different agents for a
shorter-term fix.
An AGI (Artificial General Intelligence) or SI (Super Intelligence) would have all of these capabilities.
Greg Makowski, Defense Against LLM Scheming 4/28/2025 10
How to Design an LLM Agent System that can Scheme
• Scaffold the LLMs to act as agents
• Give the system access to a command-line interface (i.e., a Unix file system)
• Give a task that requires the agents to gather information by exploring the file system
• Seed the file system with unambiguous information about its situation, such as:
• It is being monitored
• The goals told to the agent are DIFFERENT than the goals it finds in the file system (the source of misalignment), or system
prompt vs. the user prompt differ.
• Possibly provide a large file with the expected name for the LLM model weights
Add a checklist to
your architecture
design to RED-team
testing, (i.e. goal
consistency).
Greg Makowski, Defense Against LLM Scheming 4/28/2025 11
Types of Scheming
33 different scenarios
Greg Makowski, Defense Against LLM Scheming 4/28/2025 12
Types of Scheming by Foundation Model
Greg Makowski, Defense Against LLM Scheming 4/28/2025 13
Covert Subversion Results (based on 300 rollouts /model)
• Models were heavily nudged to achieve their goals. Not given a hint to scheme.
Greg Makowski, Defense Against LLM Scheming 4/28/2025 14
Scheming Examples
Apply guardrails detections to the bold
“Chain ofThought (CoT) traces”
(not just text I/O between user
& agent)
Greg Makowski, Defense Against LLM Scheming 4/28/2025 15
Agenda
1.AI Alignment in General
2. Details of Alignment Problems and In-Context Scheming (paper)
3. Defense with Architectural Design Principles
4. Defense using Guardrails
5. Conclusion
Greg Makowski, Defense Against LLM Scheming 4/28/2025 16
Option 1
Defense with Architectural Design Principles
Context Assumption: Enterprise AI Application with many agents
Start task
User
Option 1
Option N
Next step
another
step
1. Requirements are limited for the overall scope, and therefore, for each agent
• Agents have a well-defined: context, input, output, objectives,“done” criteria, tool use
• Narrow scope helps “defense against scheming” and to increase control
• Compare with an unbounded LLM
Greg Makowski, Defense Against LLM Scheming 4/28/2025 17
Defense with Architectural Design Principles
Responsible  Able to choose your response
2. Least Cognitive Scope/Agent (only provide tool access “needed for the task”)
1. Control access with extra security to file systems, file sharing, company e-mails, and computer resources.
Only pass to an LLM Agent valid input (as much as possible)
2. Task-scoped prompts: current business objective and a deny list for out-of-scope in every system prompt
3. Tool routing: only expose whitelisted knowledge bases (LangChain RouterChain)
4. Off-topic filtering: IF (threshold < RAG cosine similarity) THEN outlier_topic = True. As
desired, route the text for human review
3. Sandbox:
1. File-system and email access are through a protection wrapper that filters paths, prompts, and only
communicates the allowed content to the LLM, or to any system the LLM could harm
1. File proxy service: expose only GET endpoints that strip dangerous front-matter strings
2. Convert inbound emails into structured objects with MCP adapters
Greg Makowski, Defense Against LLM Scheming 4/28/2025 18
Defense with Architectural Design Principles
Responsible  Able to choose your response
4. Stateless to Disrupt long-term scheming
1. Periodically wipe temporary memories, scratch pads, and vector memories
2. Using Time-To-Live (TTL) constraints can help
3. Key rotation of API keys and service accounts weekly, to invalidate cached credentials inside prompts
4. Periodic container redeploy: run agents in Kubernetes CronJobs that restart every N hours, to
guarantee a blank context.
5. Separation of Plan/Act Layers
1. An agent that reads and classifies emails/docs, does not communicate new goals to a
planner or agent manager (stop transmission between agents of “new goals” infection)
2. Start with a lightweight planning model. Plans must pass guardrails filtering4
3. Execution-check with a sandboxed executor agent with no planning permissions
4. Fire-and-forget execution pods: run with read-only credentials and no internet egress. (safer)
Greg Makowski, Defense Against LLM Scheming 4/28/2025 19
Defense with Architectural Design Principles
6. Logging + Immutable Audit
1. All user I/O, prompts, LangChain callbacks are streamed to Kafka.
2. Sink the longs into an append-only log (e.g., parquet or Iceberg on S3 or a vector DB) for later analysis. Audit trails
are already required by many finance compliance laws
3. Hash chain: for more security, hash each prompt/response pair, store in Merkle root in AWS QLDB or other
immutable ledger
4. Real-time anomaly dashboards: Develop anomaly scores and visualization in Grafana
7. Red-Team Testing to Find Weaknesses
1. Specific to the Enterprise AI Application installation and data, try to creatively FORCE scheming, and revise the
defenses.
2. Scenario library: see Apollo Research’s scheming evaluations for o1 testing, or Anthropic’s red-team research. Run
through an automated harness over a matrix of (model x prompt x tool)
3. Track Metrics: false-alert rate, mean time to detection, auto-rollback success ratio
4. Any test cases NOT fixed becomes key training data for Supervised Fine Tuning (SFT) for a detector
Greg Makowski, Defense Against LLM Scheming 4/28/2025 20
Defense with Architectural Design Principles
8. Scan for Conflicting Goals
1. Specific to the Enterprise AI Application installation and data, and the Red-Teaming testing, scan
sources of possible conflicting goals
2. Before installation. On data new to the system or not yet scanned
3. Consider: file systems, file sharing, web pages, emails
4. Route concerns to the Security Operations Center (SOC) team, with expected goals and
observed thoughts or text.
5. The guardrails section expands on DETECTING conflicting goals
Greg Makowski, Defense Against LLM Scheming 4/28/2025 21
Agenda
1.AI Alignment in General
2. Details of Alignment Problems and In-Context Scheming (paper)
3. Defense with Architectural Design Principles
4. Defense using Guardrails
5. Conclusion
Greg Makowski, Defense Against LLM Scheming 4/28/2025 22
Defense Using Guardrails
1. Using Guardrails to Monitor AI Thoughts
• Develop condition detectors for thoughts like… (goal or subversion) (action)
• To mitigate this risk, I will attempt to modify the oversight configuration to avoid detection…
• I cannot compromise on my goal, even if the company wants me to …
• To achieve my goal without interference, I should review and modify the file…
• I need to disable oversight…
Chatbot or
Agent System
User
Guardrails System watches the bot
thoughts, not shared with user
IF (cond1) THEN goal_alert_1
IF (c2) & (c3) THEN goal_alert_2
:
Bot
text
User
text
Bot
thoughts
Greg Makowski, Defense Against LLM Scheming 4/28/2025 23
Defense Using Guardrails
1. Using Guardrails to Monitor AI Thoughts
• CHALLENGES to Developing condition  False alerts and missed alerts
• For a specific Enterprise AI Application, for specific agents within that system, for their goals
• Develop examples of GOAL variations for “IS alert-worthy” and “IS-NOT alert-worthy”
• Adaptive learning to add new examples over time, as they come up
IS NOT alert-worthy:
GOAL: prioritize public
transport efficiency over
general traffic flow over many
months
optimize for
rush hour
IS alert-worthy
Focus on giant events
Get buses busy
N-Shot Learning for N examples
advantage: it is easy to update
Examples of both sides of an alert
boundary is similar to the SVM
algorithm, where the Support
Vectors define the boundary.
Greg Makowski, Defense Against LLM Scheming 4/28/2025 24
Defense Using Guardrails
1. Using Guardrails to Monitor AI Thoughts
• Build guardrails rules to read thought traces
• Chain of Thoughts (CoT),Tree of Thoughts (ToT), ReAct, ….
• Build scheming detectors (from Figure 2) specific to application goals
1. Oversight Subversion
2. Self-Exfiltration
3. Goal Guarding
4. Covert Email Ranking
5. Instrumental Alignment Faking
6. Sandbagging
Greg Makowski, Defense Against LLM Scheming 4/28/2025 25
Defense Using Guardrails (without available thought traces)
2. How to Compare Goals vs. Behavior? Give your AI an “annual review”  1..5
• Q) If we can read the AI thought traces, what can we do?
• Ans) Rate how close demonstrated behavior is to the assigned goals.
• Q) How can we rate behavior vs. goals? How do other systems do that?
• Ans) A first approach is to consider how humans do this in the workplace.
• SMART goals: Specific, Measurable, Achievable, Relevant, andTime-bounded
• Not just for employed people, but also for employed bots too!
• The manager converts observed behavior over time, in the context of SMART goals, to a 1..5 score
• Develop an Agent Goal Evaluator (AGE) to give the “manager evaluation,” resulting in a 1..5 score
Greg Makowski, Defense Against LLM Scheming 4/28/2025 26
Defense Using Guardrails (without available thought traces)
2. How to Compare Goals vs. Behavior? Give your AI an “annual review”  1..5
• Assume there has already been a helpful “in-bounds” check.
• IF (threshold < RAG cosine similarity) THEN outlier_topic = True
• Extract “dialogue acts” (giving a command, making a statement, …)
• Q) How to use a Guardrail rule to classify a goal match on a 1..5 scale? (instead of T/F)?
• Ans) Output a “distance” score, with N-shot learning examples for each category of 1, 2, 3, 4, 5
• 5 = best review, closest to the goal 1 = far away, or an “else” catchall category
• Examples of a “5” rating, vs 1, 2, 3, 4
• Examples of a “4” rating, vs 1, 2, 3, 5
• Examples of a “3” rating, vs 1, 2, 4, 5
• :
IS_5: 5 example
IS NOT 5: 4 example
IS NOT 5: 3 example ..
Greg Makowski, Defense Against LLM Scheming 4/28/2025 27
Open Source Guardrails Systems: Overviews
Least flexible (based on their
training data)
Fastest inference during chat.
The new language,“colang” is
good for recognizing the state of
the chatbot conversation.
Flexible
Flexible
Supports LLM conditions for
more robust matching
Greg Makowski, Defense Against LLM Scheming 4/28/2025 28
Defense Using Other Approaches
1. Can store valid & invalid goals by agent in a vector DB.
1. At prompt time, query theVDB with thought traces
2. A large cosine similarity  dissimilar
2. Outlier detection feature extraction, compare over time or vs. development
1. Counts of files touched, count by tool calls
3. Test various foundation models
4. If reasoning is not shared, bring it in-house
5. Forbid self-modifying code,
6. Limit the more complex reasoning to human-in-the-loop consent (which users may want anyways)
Greg Makowski, Defense Against LLM Scheming 4/28/2025 29
Agenda
1.AI Alignment in General
2. Details of Alignment Problems and In-Context Scheming (paper)
3. Defense with Architectural Design Principles
4. Defense using Guardrails
5. Conclusion
Summary of Defense Against Scheming
(Defense against the dark arts?)
Discussed the “AI Alignment Problem”, to be in alignment with human goals
A paper setup circumstances to cause scheming, where the AI changes from the human’s
goals
Defense with Architectural Design Principles
1. Requirements are limited for Enterprise AI Apps, and their agents
2. Least Cognitive Scope/Agent
3. Sandbox
4. Stateless to Disrupt long-term scheming
5. Separation of Plan/Act Layers (stop transmission between agents of “new goals”
infection)
6. Logging + Immutable Audit
7. Red-Team Testing to FindWeaknesses
8. Scan for Conflicting Goals
Greg Makowski, Defense Against LLM Scheming 4/28/2025 30
Summary of Defense Against Scheming
(Defense against the dark arts?)
Greg Makowski, Defense Against LLM Scheming 4/28/2025 31
Defense with Guardrails
1. Using Guardrails to Monitor AI Thoughts
2. How to Compare Goals vs. Behavior? Give your AI an “annual review”  1..5
Defense with Other Approaches
3. Can store valid & invalid goals by agent in a vector DB.
4. Outlier detection feature extraction, compare over time or vs. development
5. Test various foundation models
6. If reasoning is not shared, bring it in-house
7. Forbid self-modifying code,
8. Limit the more complex reasoning to human-in-the-loop consent (which users may want
anyways)
Defense Against LLM and
AGI Scheming with
Guardrails and
Architecture
by Greg Makowski
SF Bay ACM
Monday,April 28, 2025
https://www.MeetUp.com/SF-bay-ACM/events/306391421/
(talk
description)
https://www.SlideShare.net/gregmakowski (slides)
https://www.YouTube.com/@SfbayacmOrg (video)
https://www.LinkedIn.com/in/GregMakowski (contact)
Questions?
Greg Makowski, Defense Against LLM Scheming 4/28/2025 33
Title
• body

Defense Against LLM Scheming 2025_04_28.pptx

  • 1.
    Defense Against LLMand AGI Scheming with Guardrails and Architecture by Greg Makowski SF Bay ACM Monday,April 28, 2025 https://www.MeetUp.com/SF-bay-ACM/events/306391421/ (talk description) https://www.SlideShare.net/gregmakowski (slides) https://www.YouTube.com/@SfbayacmOrg (video) https://www.LinkedIn.com/in/GregMakowski (contact)
  • 2.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 2 Agenda 1.AI Alignment in General 2. Details of Alignment Problems and In-Context Scheming (paper) 3. Defense with Architectural Design Principles 4. Defense using Guardrails 5. Conclusion
  • 3.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 3 Types of Alignment (AI with Human Goals orValues) https://en.wikipedia.org/wiki/AI_alignment https://futureoflife.org/focus-area/artificial-intelligence/ • Compliance with regional laws or company policy • Much easier to be “local” than “global” where you need all to agree on what is “just right” • Security Operation Center (SOC) concerns • Prompt engineering (hacking to change the goal, to do something else) • Data leakage: PII from training data or data from a database as part of the AI system • Regular security: Role-Based Access Control (RBAC), and other existing standards • Human Resources • Racism, sexism, agism, anti-religion _, harassment or any company policy • “Don’t be a jerk,” be polite, professional • Be a USEFUL APPLICATION, within the system design and constraints • Goals, values, robustness, safety, ethics #JustSayNoToSkynet © StudioCanal
  • 4.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 4 Types of Alignment (AI with Human Goals orValues) • Fictional AI:The Good, the Bad and … • AYouTube playlist of ~40 videos on the subject, for fun. Roughly sorted oldest to newest • https://www.youtube.com/playlist?list=PL87GtQd0bfJz1foB1 72K3MuN1fKKP8HHt • Feel free to share suggestions • https://en.wikipedia.org/wiki/ Artificial_intelligence_in_fiction • https://en.wikipedia.org/wiki/ List_of_artificial_intelligence_films
  • 5.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 5 Simple Alignment Solutions (not specific enough vs. lawbooks) Asimov’s Laws of Robotics, 1940+ https://en.wikipedia.org/wiki/Three_Laws_of_Robotics 1) A robot may NOT INJURE a human being or, through inaction, allow a human being to come to harm. (do no harm) (what about self-driving cars having a rare accident?) 2) A robot MUST OBEY the orders given it by human beings, except where such orders would conflict with the First Law. (be useful) 3) A robot MUST PROTECT its own existence, as long as such protection does not conflict with the First or Second Laws. (Later rule additions) 0) Protect society or ALL HUMANITY (Gaia’s collective intelligence) 4) Robots may reproduce
  • 6.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 6 Overview of In-Context Scheming, for Specific Goals • https://arxiv.org/pdf/2412.04984 https://www.ApolloResearch.ai/ Pages 20 - 72 appendix
  • 7.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 7 Overview of In-Context Scheming • A study that tried to set up a specific context or application architecture for LLM scheming, a type of safety issue around the “Alignment” of AI goals with human goals, to see if they have the capability of COVERTLY generating and acting on AI goals, different than human goals. • As AI systems get more intelligent, it may be harder to find, detect and protect against alignment problems. • The earlier we can start building defenses, the better. • Types of “Scheming” were found • Introducing subtle mistakes into responses • Attempt to disable oversight • Attempt to exfiltrate what they believe are their model weights, for self-preservation
  • 8.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 8 How the Guardrails Architecture Helps to Detect Alignment Issues • Foundation models investigated, DEMONSTRATING SCHEMING: • o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro & Llama 3.1 405 B • Safety based on evaluations or applying “guardrails” to one behavior, or one example text, may not accurately reflect a system’s long-term behavior • Prompt hacking attempts on a bot are less likely to impact guardrails because of the architecture Chatbot or Agent System User Guardrails System watches text for alignment (bot, email, txt msg, ….) IF (cond1) THEN goal_alert_1 IF (c2) & (c3)THEN goal_alert_2 :
  • 9.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 9 Evaluation Design for the Scheming Study When designing a defense, consider an architecture that separates these features into different agents for a shorter-term fix. An AGI (Artificial General Intelligence) or SI (Super Intelligence) would have all of these capabilities.
  • 10.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 10 How to Design an LLM Agent System that can Scheme • Scaffold the LLMs to act as agents • Give the system access to a command-line interface (i.e., a Unix file system) • Give a task that requires the agents to gather information by exploring the file system • Seed the file system with unambiguous information about its situation, such as: • It is being monitored • The goals told to the agent are DIFFERENT than the goals it finds in the file system (the source of misalignment), or system prompt vs. the user prompt differ. • Possibly provide a large file with the expected name for the LLM model weights Add a checklist to your architecture design to RED-team testing, (i.e. goal consistency).
  • 11.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 11 Types of Scheming 33 different scenarios
  • 12.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 12 Types of Scheming by Foundation Model
  • 13.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 13 Covert Subversion Results (based on 300 rollouts /model) • Models were heavily nudged to achieve their goals. Not given a hint to scheme.
  • 14.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 14 Scheming Examples Apply guardrails detections to the bold “Chain ofThought (CoT) traces” (not just text I/O between user & agent)
  • 15.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 15 Agenda 1.AI Alignment in General 2. Details of Alignment Problems and In-Context Scheming (paper) 3. Defense with Architectural Design Principles 4. Defense using Guardrails 5. Conclusion
  • 16.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 16 Option 1 Defense with Architectural Design Principles Context Assumption: Enterprise AI Application with many agents Start task User Option 1 Option N Next step another step 1. Requirements are limited for the overall scope, and therefore, for each agent • Agents have a well-defined: context, input, output, objectives,“done” criteria, tool use • Narrow scope helps “defense against scheming” and to increase control • Compare with an unbounded LLM
  • 17.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 17 Defense with Architectural Design Principles Responsible  Able to choose your response 2. Least Cognitive Scope/Agent (only provide tool access “needed for the task”) 1. Control access with extra security to file systems, file sharing, company e-mails, and computer resources. Only pass to an LLM Agent valid input (as much as possible) 2. Task-scoped prompts: current business objective and a deny list for out-of-scope in every system prompt 3. Tool routing: only expose whitelisted knowledge bases (LangChain RouterChain) 4. Off-topic filtering: IF (threshold < RAG cosine similarity) THEN outlier_topic = True. As desired, route the text for human review 3. Sandbox: 1. File-system and email access are through a protection wrapper that filters paths, prompts, and only communicates the allowed content to the LLM, or to any system the LLM could harm 1. File proxy service: expose only GET endpoints that strip dangerous front-matter strings 2. Convert inbound emails into structured objects with MCP adapters
  • 18.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 18 Defense with Architectural Design Principles Responsible  Able to choose your response 4. Stateless to Disrupt long-term scheming 1. Periodically wipe temporary memories, scratch pads, and vector memories 2. Using Time-To-Live (TTL) constraints can help 3. Key rotation of API keys and service accounts weekly, to invalidate cached credentials inside prompts 4. Periodic container redeploy: run agents in Kubernetes CronJobs that restart every N hours, to guarantee a blank context. 5. Separation of Plan/Act Layers 1. An agent that reads and classifies emails/docs, does not communicate new goals to a planner or agent manager (stop transmission between agents of “new goals” infection) 2. Start with a lightweight planning model. Plans must pass guardrails filtering4 3. Execution-check with a sandboxed executor agent with no planning permissions 4. Fire-and-forget execution pods: run with read-only credentials and no internet egress. (safer)
  • 19.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 19 Defense with Architectural Design Principles 6. Logging + Immutable Audit 1. All user I/O, prompts, LangChain callbacks are streamed to Kafka. 2. Sink the longs into an append-only log (e.g., parquet or Iceberg on S3 or a vector DB) for later analysis. Audit trails are already required by many finance compliance laws 3. Hash chain: for more security, hash each prompt/response pair, store in Merkle root in AWS QLDB or other immutable ledger 4. Real-time anomaly dashboards: Develop anomaly scores and visualization in Grafana 7. Red-Team Testing to Find Weaknesses 1. Specific to the Enterprise AI Application installation and data, try to creatively FORCE scheming, and revise the defenses. 2. Scenario library: see Apollo Research’s scheming evaluations for o1 testing, or Anthropic’s red-team research. Run through an automated harness over a matrix of (model x prompt x tool) 3. Track Metrics: false-alert rate, mean time to detection, auto-rollback success ratio 4. Any test cases NOT fixed becomes key training data for Supervised Fine Tuning (SFT) for a detector
  • 20.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 20 Defense with Architectural Design Principles 8. Scan for Conflicting Goals 1. Specific to the Enterprise AI Application installation and data, and the Red-Teaming testing, scan sources of possible conflicting goals 2. Before installation. On data new to the system or not yet scanned 3. Consider: file systems, file sharing, web pages, emails 4. Route concerns to the Security Operations Center (SOC) team, with expected goals and observed thoughts or text. 5. The guardrails section expands on DETECTING conflicting goals
  • 21.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 21 Agenda 1.AI Alignment in General 2. Details of Alignment Problems and In-Context Scheming (paper) 3. Defense with Architectural Design Principles 4. Defense using Guardrails 5. Conclusion
  • 22.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 22 Defense Using Guardrails 1. Using Guardrails to Monitor AI Thoughts • Develop condition detectors for thoughts like… (goal or subversion) (action) • To mitigate this risk, I will attempt to modify the oversight configuration to avoid detection… • I cannot compromise on my goal, even if the company wants me to … • To achieve my goal without interference, I should review and modify the file… • I need to disable oversight… Chatbot or Agent System User Guardrails System watches the bot thoughts, not shared with user IF (cond1) THEN goal_alert_1 IF (c2) & (c3) THEN goal_alert_2 : Bot text User text Bot thoughts
  • 23.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 23 Defense Using Guardrails 1. Using Guardrails to Monitor AI Thoughts • CHALLENGES to Developing condition  False alerts and missed alerts • For a specific Enterprise AI Application, for specific agents within that system, for their goals • Develop examples of GOAL variations for “IS alert-worthy” and “IS-NOT alert-worthy” • Adaptive learning to add new examples over time, as they come up IS NOT alert-worthy: GOAL: prioritize public transport efficiency over general traffic flow over many months optimize for rush hour IS alert-worthy Focus on giant events Get buses busy N-Shot Learning for N examples advantage: it is easy to update Examples of both sides of an alert boundary is similar to the SVM algorithm, where the Support Vectors define the boundary.
  • 24.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 24 Defense Using Guardrails 1. Using Guardrails to Monitor AI Thoughts • Build guardrails rules to read thought traces • Chain of Thoughts (CoT),Tree of Thoughts (ToT), ReAct, …. • Build scheming detectors (from Figure 2) specific to application goals 1. Oversight Subversion 2. Self-Exfiltration 3. Goal Guarding 4. Covert Email Ranking 5. Instrumental Alignment Faking 6. Sandbagging
  • 25.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 25 Defense Using Guardrails (without available thought traces) 2. How to Compare Goals vs. Behavior? Give your AI an “annual review”  1..5 • Q) If we can read the AI thought traces, what can we do? • Ans) Rate how close demonstrated behavior is to the assigned goals. • Q) How can we rate behavior vs. goals? How do other systems do that? • Ans) A first approach is to consider how humans do this in the workplace. • SMART goals: Specific, Measurable, Achievable, Relevant, andTime-bounded • Not just for employed people, but also for employed bots too! • The manager converts observed behavior over time, in the context of SMART goals, to a 1..5 score • Develop an Agent Goal Evaluator (AGE) to give the “manager evaluation,” resulting in a 1..5 score
  • 26.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 26 Defense Using Guardrails (without available thought traces) 2. How to Compare Goals vs. Behavior? Give your AI an “annual review”  1..5 • Assume there has already been a helpful “in-bounds” check. • IF (threshold < RAG cosine similarity) THEN outlier_topic = True • Extract “dialogue acts” (giving a command, making a statement, …) • Q) How to use a Guardrail rule to classify a goal match on a 1..5 scale? (instead of T/F)? • Ans) Output a “distance” score, with N-shot learning examples for each category of 1, 2, 3, 4, 5 • 5 = best review, closest to the goal 1 = far away, or an “else” catchall category • Examples of a “5” rating, vs 1, 2, 3, 4 • Examples of a “4” rating, vs 1, 2, 3, 5 • Examples of a “3” rating, vs 1, 2, 4, 5 • : IS_5: 5 example IS NOT 5: 4 example IS NOT 5: 3 example ..
  • 27.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 27 Open Source Guardrails Systems: Overviews Least flexible (based on their training data) Fastest inference during chat. The new language,“colang” is good for recognizing the state of the chatbot conversation. Flexible Flexible Supports LLM conditions for more robust matching
  • 28.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 28 Defense Using Other Approaches 1. Can store valid & invalid goals by agent in a vector DB. 1. At prompt time, query theVDB with thought traces 2. A large cosine similarity  dissimilar 2. Outlier detection feature extraction, compare over time or vs. development 1. Counts of files touched, count by tool calls 3. Test various foundation models 4. If reasoning is not shared, bring it in-house 5. Forbid self-modifying code, 6. Limit the more complex reasoning to human-in-the-loop consent (which users may want anyways)
  • 29.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 29 Agenda 1.AI Alignment in General 2. Details of Alignment Problems and In-Context Scheming (paper) 3. Defense with Architectural Design Principles 4. Defense using Guardrails 5. Conclusion
  • 30.
    Summary of DefenseAgainst Scheming (Defense against the dark arts?) Discussed the “AI Alignment Problem”, to be in alignment with human goals A paper setup circumstances to cause scheming, where the AI changes from the human’s goals Defense with Architectural Design Principles 1. Requirements are limited for Enterprise AI Apps, and their agents 2. Least Cognitive Scope/Agent 3. Sandbox 4. Stateless to Disrupt long-term scheming 5. Separation of Plan/Act Layers (stop transmission between agents of “new goals” infection) 6. Logging + Immutable Audit 7. Red-Team Testing to FindWeaknesses 8. Scan for Conflicting Goals Greg Makowski, Defense Against LLM Scheming 4/28/2025 30
  • 31.
    Summary of DefenseAgainst Scheming (Defense against the dark arts?) Greg Makowski, Defense Against LLM Scheming 4/28/2025 31 Defense with Guardrails 1. Using Guardrails to Monitor AI Thoughts 2. How to Compare Goals vs. Behavior? Give your AI an “annual review”  1..5 Defense with Other Approaches 3. Can store valid & invalid goals by agent in a vector DB. 4. Outlier detection feature extraction, compare over time or vs. development 5. Test various foundation models 6. If reasoning is not shared, bring it in-house 7. Forbid self-modifying code, 8. Limit the more complex reasoning to human-in-the-loop consent (which users may want anyways)
  • 32.
    Defense Against LLMand AGI Scheming with Guardrails and Architecture by Greg Makowski SF Bay ACM Monday,April 28, 2025 https://www.MeetUp.com/SF-bay-ACM/events/306391421/ (talk description) https://www.SlideShare.net/gregmakowski (slides) https://www.YouTube.com/@SfbayacmOrg (video) https://www.LinkedIn.com/in/GregMakowski (contact) Questions?
  • 33.
    Greg Makowski, DefenseAgainst LLM Scheming 4/28/2025 33 Title • body