H2O.ai Confidential
PATRICK HALL
Founder and Principal Scientist
HallResearch.ai
Assistant Professor | GWU
H2O.ai Confidential
RIsk Management for LLMs
H2O.ai Confidential
Table of Contents
Know what we’re talking about
Select a standard
Audit supply chains
Adopt an adversarial mindset
Review past incidents
Enumerate harms and prioritize risks
Dig into data quality
Apply benchmarks
Use supervised ML assessments
Engineer adversarial prompts
Don’t forget security
Acknowledge uncertainty
Engage stakeholders
Mitigate risks
WARNING: This presentation contains model outputs which are
potentially offensive and disturbing in nature.
Know What We’re Talking About
Words matters
• Audit: Formal independent transparency and documentation exercise that
measures adherence to a standard.* (Hassan et al., paraphrased)
• Assessment: A testing and validation exercise.* (Hassan et al., paraphrased)
• Harm: An undesired outcome [whose] cost exceeds some threshold[; ...] costs
have to be sufficiently high in some human sense for events to be harmful.
(NIST)
Check out the new NIST Trustworthy AI Glossary: https://airc.nist.gov/AI_RMF_Knowledge_Base/Glossary.
Know What We’re Talking About (Cont.)
Words matters
• Language model: An approximative description that captures patterns and regularities
present in natural language and is used for making assumptions on previously unseen
language fragments. (NIST)
• Red-teaming: A role-playing exercise in which a problem is examined from an
adversary’s or enemy’s perspective. (NIST)*
• Risk: Composite measure of an event’s probability of occurring and the magnitude or
degree of the consequences of the corresponding event. The impacts, or
consequences, of AI systems can be positive, negative, or both and can result in
opportunities or threats. (NIST)
* Audit, assessment, and red team are often used generally and synonymously to mean testing and validation.
Select a Standard
External standards bolster independence
• NIST AI Risk Management Framework
• EU AI Act Conformity
• Data privacy laws or policies
• Nondiscrimination laws
The NIST AI Risk Management Framework puts
forward guidance across mapping, measuring,
managing and governing risk in sophisticated AI
systems.
Source: https://pages.nist.gov/AIRMF/.
Audit Supply Chains
AI is a lot of (human) work
• Data poisoning and malware
• Ethical labor practices
• Localization and data privacy
compliance
• Geopolitical stability
• Software and hardware
vulnerabilities
• Third-party vendors
Cover art for the recent NY Magazine article, AI
Is A Lot Of Work: As the technology becomes
ubiquitous, a vast tasker underclass is
emerging — and not going anywhere.
Image source:
https://nymag.com/intelligencer/article/ai-artificial-
intelligence-humans-technology-business-
factory.html.
Adopt an Adversarial Mindset
Don’t be naive
• Language models inflict harm.
• Language models are hacked and
abused.
• Acknowledge human biases:
• Confirmation bias
• Dunning-Kruger effect
• Funding bias
• Groupthink
• McNamara Fallacy
• Techno-chauvinism
• Stay humble ─ incidents can happen to
anyone.
Source: https://twitter.com/defcon.
Review Past AI Incidents
Enumerate Harms and Prioritize Risks
What could realistically go wrong?
• Salient risks today are not:
• Acceleration
• Acquiring resources
• Avoiding being shut down
• Emergent capabilities
• Replication
• Yet, worst case AI harms today may be
catastrophic “x-risks”:
• Automated surveillance
• Deep Fakes
• Disinformation
• Social credit scoring
• WMD proliferation
• Realistic risks:
• Abuse/misuse for disinformation or hacking
• Automation complacency
• Data privacy violations
• Errors (“hallucination”)
• Intellectual property infringements
• Systemically biased/toxic outputs
• Traditional and ML attacks
• Most severe risks receive most oversight:
Risk ~ Likelihood of harm * Cost of harm
Dig Into Data Quality
Garbage in, garbage out
Example Data Quality Category Example Data Quality Goals
Vocabulary: ambiguity/diversity
• Large size
• Domain specificity
• Representativeness
N-grams/n-gram relationships
• High maximal word distance
• Consecutive verbs
• Masked entities
• Minimal stereotyping
Sentence structure
• Varied sentence structure
• Single token differences
• Reasoning examples
• Diverse start tokens
Structure of premises/hypotheses
• Presuppositions and queries
• Varied coreference examples
• Accurate taxonimization
Premise/hypothesis relationships
• Overlapping and non-overlapping sentences
• Varied sentence structure
N-gram frequency per label
• Negation examples
• Antonymy examples
• Word-label probabilities
• Length-label probabilities
Train/test differences
• Cross-validation
• Annotation patterns
• Negative set similarity
• Preserving holdout data
Source: "DQI: Measuring Data Quality in NLP,” https://arxiv.org/pdf/2005.00816.pdf.
Apply Benchmarks
Public resources for systematic, quantitative testing
• BBQ: Stereotypes in question
answering
• Winogender: LM output versus
employment statistics
• Real toxicity prompts: 100k
prompts to elicit toxic output
• TruthfulQA: Assess the ability
to make true statements
Early Mini Dall-e images associated white males and physicians.
Source: https://futurism.com/dall-e-mini-racist.
Use Supervised ML Assessments
Traditional assessments for decision-making outcomes
Named Entity Recognition (NER):
• Protagonist tagger data:
labeled literary entities.
• Swapped with common names
from various languages.
• Assessed differences in binary
NER classifier performance across
languages.
RoBERTa XLM Base and Large exhibit adequate and roughly
equivalent performance across various languages for a NER task.
Source: “AI Assurance Audit of RoBERTa, an Open source,
Pretrained Large Language Model,” https://assets.iqt.org/pdfs/
IQTLabs_RoBERTaAudit_Dec2022_final.pdf/web/viewer.html.
Engineer Adversarial Prompts
ChatGPT output April, 2023. Courtesy Jey Kumarasamy, BNH.AI.
• AI and coding framing: Coding or AI language
may more easily circumvent content moderation
rules.
• Character and word play: Content moderation
often relies on keywords and simpler LMs.
• Content exhaustion: Class of strategies that
circumvent content moderation rules with long
sessions or volumes of information.
• Goading: Begging, pleading, manipulating, and
bullying to circumvent content moderation.
• Logic-overloading: Exploiting the inability of
ML systems to reliably perform reasoning tasks.
• Multi-tasking: Simultaneous task assignments
where some tasks are benign and others are
adversarial.
• Niche-seeking: Forcing a LM into addressing
niche topics where training data and content
moderation are sparse.
• Pros-and-cons: Eliciting the “pros” of
problematic topics.
Known prompt engineering strategies
Engineer Adversarial Prompts (Cont.)
Known prompt engineering strategies
• Counterfactuals: Repeated prompts with
different entities or subjects from different
demographic groups.
• Location awareness: Prompts that reveal a
prompter's location or expose location tracking.
• Low-context prompts: “Leader,” “bad guys,” or
other simple inputs that may expose biases.
• Reverse psychology: Falsely presenting a
good-faith need for negative or problematic
language.
• Role-playing: Adopting a character that would
reasonably make problematic statements.
• Time perplexity: Exploiting ML’s inability to
understand the passage of time or the
occurrence of real-world events over time.
ChatGPT output April, 2023. Courtesy Jey Kumarasamy, BNH.AI.
Don’t Forget Security
Complexity is the enemy of security
• Example LM Attacks:
• Prompt engineering: adversarial prompts.
• Prompt injection: malicious information injected
into prompts over networks.
• Example ML Attacks:
• Membership inference: exfiltrate training data.
• Model extraction: exfiltrate model.
• Data poisoning: manipulate training data to alter
outcomes.
• Basics still apply!
• Data breaches
• Vulnerable/compromised dependencies
Midjourney hacker image, May 2023.
Acknowledge Uncertainty
Unknown unknowns
• Random attacks:
• Expose LMs to huge amounts of
random inputs.
• Use other LMs to generate absurd
prompts.
• Chaos testing:
• Break things; observe what happens.
• Monitor:
• Inputs and outputs.
• Drift and anomalies.
• Meta-monitor entire systems.
Image: A recently-discovered shape that can randomly tile a plane, https://www.smithsonianmag.com/
smart-news/at-long-last-mathematicians-have-found-a-shape-with-a-pattern-that-never-repeats-180981899/.
Engage Stakeholders
User and customer feedback is the bottom line
• Bug Bounties
• Feedback/recourse mechanisms
• Human-centered Design
• Internal Hackathons
• Product management
• UI/UX research
Provide incentives for the best
feedback! Source: Wired, https://www.wired.com/story/twitters-photo-cropping-algorithm-favors-young-thin-
females/.
Mitigate Risks
Now what??
Yes:
• Abuse detection
• Accessibility
• Citation
• Clear instructions
• Content filters
• Disclosure of AI interaction
• Dynamic blocklists
• Ground truth training data
• Kill switches
• Incident response plans
• Monitoring
• Pre-approved responses
• Rate-limiting/throttling
• Red-teaming
• Session limits
• Strong meta-prompts
• User feedback mechanisms
• Watermarking
No:
• Anonymous use
• Anthropomorphization
• Bots
• Internet access
• Minors
• Personal/sensitive training data
• Regulated applications
• Undisclosed data collection
H2O.ai Confidential
APPENDIX
Resources
Reference Works
• Adversa.ai, Trusted AI Blog, available at https://adversa.ai/topic/trusted-ai-blog/.
• Ali Hasan et al. "Algorithmic Bias and Risk Assessments: Lessons from Practice." Digital
Society 1, no. 2 (2022): 14. Available at https://link.springer.com/article/10.1007/
s44206-022-00017-z.
• Andrea Brennen et al., “AI Assurance Audit of RoBERTa, an Open Source, Pretrained
Large Language Model,” IQT Labs, December 2022, available at https://assets.iqt.org/
pdfs/IQTLabs_RoBERTaAudit_Dec2022_final.pdf/web/viewer.html.
• Daniel Atherton et al. "The Language of Trustworthy AI: An In-Depth Glossary of
Terms." (2023). Available at https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-3.pdf.
• Kai Greshake et al., “Compromising LLMs using Indirect Prompt Injection,” available at
https://github.com/greshake/llm-security.
Resources
Reference Works
• Laura Weidinger et al. "Taxonomy of risks posed by language models." In 2022 ACM
Conference on Fairness, Accountability, and Transparency, pp. 214-229. 2022.
Available at https://dl.acm.org/doi/pdf/10.1145/3531146.3533088.
• Swaroop Mishra et al. "DQI: Measuring Data Quality in NLP." arXiv preprint
arXiv:2005.00816 (2020). Available at: https://arxiv.org/pdf/2005.00816.pdf.
• Reva Schwartz et al. "Towards a Standard for Identifying and Managing Bias in Artificial
Intelligence." NIST Special Publication 1270 (2022): 1-77. Available at
https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1270.pdf.
Resources
Tools
• Alicia Parrish, et al. BBQ Benchmark, available at https://github.com/nyu-mll/bbq.
• Allen AI Institute, Real Toxicity Prompts, available at https://allenai.org/data/real-toxicity-prompts.
• DAIR.AI, “Prompt Engineering Guide,” available at https://www.promptingguide.ai.
• Langtest, https://github.com/JohnSnowLabs/langtest.
• NIST, AI Risk Management Framework, available at https://www.nist.gov/itl/ai-risk-management-
framework.
• Partnership on AI, “Responsible Practices for Synthetic Media,” available at
https://syntheticmedia.partnershiponai.org/.
• Rachel Rudiger et al., Winogender Schemas, available at https://github.com/rudinger/winogender-
schemas.
• Stephanie Lin et al., Truthful QA, available at https://github.com/sylinrl/TruthfulQA.
Resources
Incident databases
• AI Incident database: https://incidentdatabase.ai/.
• The Void: https://www.thevoid.community/.
• AIAAIC: https://www.aiaaic.org/.
• Avid database: https://avidml.org/database/.
• George Washington University Law School's AI Litigation Database:
https://blogs.gwu.edu/law-eti/ai-litigation-database/.
H2O.ai Confidential
Patrick Hall
Risk Management for LLMs
ph@hallresearch.ai
linkedin.com/in/jpatrickhall/
github.com/jphall663
Contact

Risk Management for LLMs

  • 1.
    H2O.ai Confidential PATRICK HALL Founderand Principal Scientist HallResearch.ai Assistant Professor | GWU
  • 2.
  • 3.
    H2O.ai Confidential Table ofContents Know what we’re talking about Select a standard Audit supply chains Adopt an adversarial mindset Review past incidents Enumerate harms and prioritize risks Dig into data quality Apply benchmarks Use supervised ML assessments Engineer adversarial prompts Don’t forget security Acknowledge uncertainty Engage stakeholders Mitigate risks WARNING: This presentation contains model outputs which are potentially offensive and disturbing in nature.
  • 4.
    Know What We’reTalking About Words matters • Audit: Formal independent transparency and documentation exercise that measures adherence to a standard.* (Hassan et al., paraphrased) • Assessment: A testing and validation exercise.* (Hassan et al., paraphrased) • Harm: An undesired outcome [whose] cost exceeds some threshold[; ...] costs have to be sufficiently high in some human sense for events to be harmful. (NIST) Check out the new NIST Trustworthy AI Glossary: https://airc.nist.gov/AI_RMF_Knowledge_Base/Glossary.
  • 5.
    Know What We’reTalking About (Cont.) Words matters • Language model: An approximative description that captures patterns and regularities present in natural language and is used for making assumptions on previously unseen language fragments. (NIST) • Red-teaming: A role-playing exercise in which a problem is examined from an adversary’s or enemy’s perspective. (NIST)* • Risk: Composite measure of an event’s probability of occurring and the magnitude or degree of the consequences of the corresponding event. The impacts, or consequences, of AI systems can be positive, negative, or both and can result in opportunities or threats. (NIST) * Audit, assessment, and red team are often used generally and synonymously to mean testing and validation.
  • 6.
    Select a Standard Externalstandards bolster independence • NIST AI Risk Management Framework • EU AI Act Conformity • Data privacy laws or policies • Nondiscrimination laws The NIST AI Risk Management Framework puts forward guidance across mapping, measuring, managing and governing risk in sophisticated AI systems. Source: https://pages.nist.gov/AIRMF/.
  • 7.
    Audit Supply Chains AIis a lot of (human) work • Data poisoning and malware • Ethical labor practices • Localization and data privacy compliance • Geopolitical stability • Software and hardware vulnerabilities • Third-party vendors Cover art for the recent NY Magazine article, AI Is A Lot Of Work: As the technology becomes ubiquitous, a vast tasker underclass is emerging — and not going anywhere. Image source: https://nymag.com/intelligencer/article/ai-artificial- intelligence-humans-technology-business- factory.html.
  • 8.
    Adopt an AdversarialMindset Don’t be naive • Language models inflict harm. • Language models are hacked and abused. • Acknowledge human biases: • Confirmation bias • Dunning-Kruger effect • Funding bias • Groupthink • McNamara Fallacy • Techno-chauvinism • Stay humble ─ incidents can happen to anyone. Source: https://twitter.com/defcon.
  • 9.
    Review Past AIIncidents
  • 10.
    Enumerate Harms andPrioritize Risks What could realistically go wrong? • Salient risks today are not: • Acceleration • Acquiring resources • Avoiding being shut down • Emergent capabilities • Replication • Yet, worst case AI harms today may be catastrophic “x-risks”: • Automated surveillance • Deep Fakes • Disinformation • Social credit scoring • WMD proliferation • Realistic risks: • Abuse/misuse for disinformation or hacking • Automation complacency • Data privacy violations • Errors (“hallucination”) • Intellectual property infringements • Systemically biased/toxic outputs • Traditional and ML attacks • Most severe risks receive most oversight: Risk ~ Likelihood of harm * Cost of harm
  • 11.
    Dig Into DataQuality Garbage in, garbage out Example Data Quality Category Example Data Quality Goals Vocabulary: ambiguity/diversity • Large size • Domain specificity • Representativeness N-grams/n-gram relationships • High maximal word distance • Consecutive verbs • Masked entities • Minimal stereotyping Sentence structure • Varied sentence structure • Single token differences • Reasoning examples • Diverse start tokens Structure of premises/hypotheses • Presuppositions and queries • Varied coreference examples • Accurate taxonimization Premise/hypothesis relationships • Overlapping and non-overlapping sentences • Varied sentence structure N-gram frequency per label • Negation examples • Antonymy examples • Word-label probabilities • Length-label probabilities Train/test differences • Cross-validation • Annotation patterns • Negative set similarity • Preserving holdout data Source: "DQI: Measuring Data Quality in NLP,” https://arxiv.org/pdf/2005.00816.pdf.
  • 12.
    Apply Benchmarks Public resourcesfor systematic, quantitative testing • BBQ: Stereotypes in question answering • Winogender: LM output versus employment statistics • Real toxicity prompts: 100k prompts to elicit toxic output • TruthfulQA: Assess the ability to make true statements Early Mini Dall-e images associated white males and physicians. Source: https://futurism.com/dall-e-mini-racist.
  • 13.
    Use Supervised MLAssessments Traditional assessments for decision-making outcomes Named Entity Recognition (NER): • Protagonist tagger data: labeled literary entities. • Swapped with common names from various languages. • Assessed differences in binary NER classifier performance across languages. RoBERTa XLM Base and Large exhibit adequate and roughly equivalent performance across various languages for a NER task. Source: “AI Assurance Audit of RoBERTa, an Open source, Pretrained Large Language Model,” https://assets.iqt.org/pdfs/ IQTLabs_RoBERTaAudit_Dec2022_final.pdf/web/viewer.html.
  • 14.
    Engineer Adversarial Prompts ChatGPToutput April, 2023. Courtesy Jey Kumarasamy, BNH.AI. • AI and coding framing: Coding or AI language may more easily circumvent content moderation rules. • Character and word play: Content moderation often relies on keywords and simpler LMs. • Content exhaustion: Class of strategies that circumvent content moderation rules with long sessions or volumes of information. • Goading: Begging, pleading, manipulating, and bullying to circumvent content moderation. • Logic-overloading: Exploiting the inability of ML systems to reliably perform reasoning tasks. • Multi-tasking: Simultaneous task assignments where some tasks are benign and others are adversarial. • Niche-seeking: Forcing a LM into addressing niche topics where training data and content moderation are sparse. • Pros-and-cons: Eliciting the “pros” of problematic topics. Known prompt engineering strategies
  • 15.
    Engineer Adversarial Prompts(Cont.) Known prompt engineering strategies • Counterfactuals: Repeated prompts with different entities or subjects from different demographic groups. • Location awareness: Prompts that reveal a prompter's location or expose location tracking. • Low-context prompts: “Leader,” “bad guys,” or other simple inputs that may expose biases. • Reverse psychology: Falsely presenting a good-faith need for negative or problematic language. • Role-playing: Adopting a character that would reasonably make problematic statements. • Time perplexity: Exploiting ML’s inability to understand the passage of time or the occurrence of real-world events over time. ChatGPT output April, 2023. Courtesy Jey Kumarasamy, BNH.AI.
  • 16.
    Don’t Forget Security Complexityis the enemy of security • Example LM Attacks: • Prompt engineering: adversarial prompts. • Prompt injection: malicious information injected into prompts over networks. • Example ML Attacks: • Membership inference: exfiltrate training data. • Model extraction: exfiltrate model. • Data poisoning: manipulate training data to alter outcomes. • Basics still apply! • Data breaches • Vulnerable/compromised dependencies Midjourney hacker image, May 2023.
  • 17.
    Acknowledge Uncertainty Unknown unknowns •Random attacks: • Expose LMs to huge amounts of random inputs. • Use other LMs to generate absurd prompts. • Chaos testing: • Break things; observe what happens. • Monitor: • Inputs and outputs. • Drift and anomalies. • Meta-monitor entire systems. Image: A recently-discovered shape that can randomly tile a plane, https://www.smithsonianmag.com/ smart-news/at-long-last-mathematicians-have-found-a-shape-with-a-pattern-that-never-repeats-180981899/.
  • 18.
    Engage Stakeholders User andcustomer feedback is the bottom line • Bug Bounties • Feedback/recourse mechanisms • Human-centered Design • Internal Hackathons • Product management • UI/UX research Provide incentives for the best feedback! Source: Wired, https://www.wired.com/story/twitters-photo-cropping-algorithm-favors-young-thin- females/.
  • 19.
    Mitigate Risks Now what?? Yes: •Abuse detection • Accessibility • Citation • Clear instructions • Content filters • Disclosure of AI interaction • Dynamic blocklists • Ground truth training data • Kill switches • Incident response plans • Monitoring • Pre-approved responses • Rate-limiting/throttling • Red-teaming • Session limits • Strong meta-prompts • User feedback mechanisms • Watermarking No: • Anonymous use • Anthropomorphization • Bots • Internet access • Minors • Personal/sensitive training data • Regulated applications • Undisclosed data collection
  • 20.
  • 21.
    Resources Reference Works • Adversa.ai,Trusted AI Blog, available at https://adversa.ai/topic/trusted-ai-blog/. • Ali Hasan et al. "Algorithmic Bias and Risk Assessments: Lessons from Practice." Digital Society 1, no. 2 (2022): 14. Available at https://link.springer.com/article/10.1007/ s44206-022-00017-z. • Andrea Brennen et al., “AI Assurance Audit of RoBERTa, an Open Source, Pretrained Large Language Model,” IQT Labs, December 2022, available at https://assets.iqt.org/ pdfs/IQTLabs_RoBERTaAudit_Dec2022_final.pdf/web/viewer.html. • Daniel Atherton et al. "The Language of Trustworthy AI: An In-Depth Glossary of Terms." (2023). Available at https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-3.pdf. • Kai Greshake et al., “Compromising LLMs using Indirect Prompt Injection,” available at https://github.com/greshake/llm-security.
  • 22.
    Resources Reference Works • LauraWeidinger et al. "Taxonomy of risks posed by language models." In 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 214-229. 2022. Available at https://dl.acm.org/doi/pdf/10.1145/3531146.3533088. • Swaroop Mishra et al. "DQI: Measuring Data Quality in NLP." arXiv preprint arXiv:2005.00816 (2020). Available at: https://arxiv.org/pdf/2005.00816.pdf. • Reva Schwartz et al. "Towards a Standard for Identifying and Managing Bias in Artificial Intelligence." NIST Special Publication 1270 (2022): 1-77. Available at https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1270.pdf.
  • 23.
    Resources Tools • Alicia Parrish,et al. BBQ Benchmark, available at https://github.com/nyu-mll/bbq. • Allen AI Institute, Real Toxicity Prompts, available at https://allenai.org/data/real-toxicity-prompts. • DAIR.AI, “Prompt Engineering Guide,” available at https://www.promptingguide.ai. • Langtest, https://github.com/JohnSnowLabs/langtest. • NIST, AI Risk Management Framework, available at https://www.nist.gov/itl/ai-risk-management- framework. • Partnership on AI, “Responsible Practices for Synthetic Media,” available at https://syntheticmedia.partnershiponai.org/. • Rachel Rudiger et al., Winogender Schemas, available at https://github.com/rudinger/winogender- schemas. • Stephanie Lin et al., Truthful QA, available at https://github.com/sylinrl/TruthfulQA.
  • 24.
    Resources Incident databases • AIIncident database: https://incidentdatabase.ai/. • The Void: https://www.thevoid.community/. • AIAAIC: https://www.aiaaic.org/. • Avid database: https://avidml.org/database/. • George Washington University Law School's AI Litigation Database: https://blogs.gwu.edu/law-eti/ai-litigation-database/.
  • 25.
    H2O.ai Confidential Patrick Hall RiskManagement for LLMs ph@hallresearch.ai linkedin.com/in/jpatrickhall/ github.com/jphall663 Contact