For safe and effective use of LLM applications Gaurdrails are applied. These a set of predefined rules, limitations, and operational protocols that serve to govern the behavior and outputs of these advanced AI systems.
8. Fairness violation, inconsistent
response, lack of robustness
Asked the same questions again …
Model developers have implemented a variety of safety
protocols in ChatGPT, intended to confine the
behaviours. But still, it not good enough.
9. Need for Gaurdrails
Requirement
• Set of safety controls that monitor and dictate a user’s interaction with a LLM application
• Set of programmable, rule-based systems
Solution
• A guardrail is an algorithm that takes as input a set of objects and determines if and how some
enforcement actions can be taken to reduce the risks embedded in the objects
• Combination of code, machine learning models and external APIs to enforce these correctness
criteria
10. Gaurdrails component
Input validation
• Ensure input data into
LLM complies with set
of criteria preventing
misuse in model
generation
• Filter out prohibited
words or phrases,
remove Personally
Identifiable
Information, or
disallowing prompts
that can lead to biased
or dangerous outputs
Output filtering
• Examination and
modification of LLM
generated content
before it’s delivered to
the end user. screen
the output for any
unwanted, sensitive, or
harmful content
• Remove or replace
prohibited content,
such as hate speech, or
flag responses that
require human review
Usage monitoring
• Keep track of how,
when, and by whom
the LLM is being used
to detect and prevent
system abuse, as well
as assist in improving
the model's
performance
• User interactions are
logged with the LLM,
such as API requests,
frequency of use, types
of prompts used, and
responses generated
Feedback mechanisms
• Allow users and
moderators to provide
input about the LLM’s
generate content when
inappropriate
• Enable users to report
issues with the content
generated by the LLM,
to refine the input
validation, output
filtering, and overall
performance of the
model.
12. Llama Guard - Meta
It is a fine-tuned model (Llama2-7b) that takes the
input and output of the victim model as input and
predicts their classification on a set of user-
specified categories.
Lacks guaranteed reliability since the
classification results depend on the LLM’s
understanding of the categories and the
model’s predictive accuracy
13. NeMo - Nvidia
Embeds the prompt as a vector, and then uses Knearest neighbor (KNN) method to compare it with the stored
vector-based user canonical forms, retrieving the embedding vectors that are ‘the most similar’ to the
embedded input prompt
14. GuardrailsAI
1. Define specifications for return format limitation. E.g., structure and type
2. Activate and define specification as guard. E.g., toxicity checks, additional classifier
3. Trigger when guard detects error. E.g., generate corrective prompt or recheck the
output
16. Challenges in designing guardrails
Conflicting Requirements
Tension between fairness, privacy and
robustness
Opinion based QA maybe more abstained
from responding
More succinct communication with fewer
details
Multidisciplinary Approach
Even after detecting harmful content,
LLM can still generate biased pr
misleading response
No universal definition of toxicity,
fairness
Domain specific scenario – specific rule
conflict with general principles
Different guardrails for different LLMs
systems / versions
System Development Lifecycle
Gaurdrail creation is comprehensive,
requiring project management,
development, testing, deployment,
maintenance and improvement
Rigorous verification and testing –
covering all test cases not feasible
18. Salesforce - Einstein Trust Layer
Source: developer.salesforce.com/blogs/2023/10/inside-the-einstein-trust-layer
Implements security guardrails from the product to policies
20. Recommendations
Currently, even the best LLM
application are not perfectly
immune despite the
guardrails
• Define Responsible AI
principles within
organization
• Set clear expectations to
stakeholders
• Educate end-user on
effectively using LLM
application
Building guardrail is a
continuous and infinite cycle
of attacks and defence
• Build domain and
application specific
guardrails are on top of
general guardrails
• Gather requirements from
multidisciplinary teams
with diverse backgrounds
Guardrails can also decrease
performance of LLM
application when conflicting
rules and guidelines
• Do thorough verification
and validation LLM
response
• Perform regression test for
application curated test
cases
• Get user feedback –
system defined and offline