LLM Interpretability

H2O.ai Confidential
KIM MONTGOMERY
Principal Data Scientist,
H2O.ai

What is Generative AI?
GenAI enables the creation of novel content
Input
GenAI Model
Learns patterns in
unstructured data
Unstructured data
Output Novel Content
Data
Traditional AI Model
Learns relationship
between data and label
Output Label
Labels
VS

H2O.ai Confidential
For complex models (neural networks, gradient boosters, etc. )
● Lack of transparency
○ It’s not obvious what the model is calculating.
○ It’s not obvious why the model made a decision.
● And may not be obvious when the model breaks.
● Model robustness issues. May get strange results for out of
distribution input.
● Model probing can leak private information.
● May contain bias to certain groups
Responsible AI for Traditional ML

H2O.ai Confidential
For complex models (LLM )
● Lack of transparency
○ It’s not obvious what the model is calculating.
○ It’s not obvious why the model made a decision.
● And may not be obvious when the model breaks.
● Model robustness issues. May get strange results for out of
distribution input.
● Model probing can leak private information.
● May contain bias to certain groups
Responsible AI for Gen AI

v
H2O.ai Confidential
Interpretability Supervised AI
Global
● What is the average quality of the model in general?
○ Accuracy
○ Feature importance
○ Fairness
Local
● What are the properties of a single response?
○ Correct / Incorrect
○ Local feature importance
○ Robustness to perturbations

v
H2O.ai Confidential
Interpretability: Traditional ML

v
H2O.ai Confidential
Interpretability: Global / Local LLM
Global measures
● How accurate is the model in general?
● How frequently does it hallucinate?
● How frequently does the answer contain undesirable qualities like toxicity,
privacy violations, or unfairness?
Local measures
● Is the current response accurate?
● Does the current response contain undesirable qualities like toxicity,
privacy violations, or unfairness?

v
H2O.ai Confidential
Accuracy: Traditional ML
Traditional machine
learning
● Comparing a prediction
to an outcome
● Generally the correct
labels are in a simple
format

v
H2O.ai Confidential
Accuracy: LLMs
● Frequently sound reasonable
● Can hallucinate
● The training data may be from a huge training set that is difficult
to check.

v
H2O.ai Confidential
“Open the pod bay doors, Hal.”

v
H2O.ai Confidential
Accuracy: Retrieval Augmented Generation (RAG) Provides a
Simple Solution for Some Applications

v
H2O.ai Confidential
Accuracy: LLMs
Confirm results against a given source:
● Checking results against a given source (RAG)
● Checking results against the tuning data
● Checking results against an external source (eg wikipedia)
● Checking results against the training data (cumbersome).
● Checking for self-consistency (Self-check GPT)
Scoring methods
● Natural language inference
● Comparing embeddings
● Influence functions

v
H2O.ai Confidential
Counterfactual analysis: Traditional ML
● How does changing a feature change the model outcome?
● What is the smallest change that can change the outcome?

v
H2O.ai Confidential
Counterfactual analysis: LLM
How consistent are results under different:
● Prompts / instructions.
● Proper names or pronouns (fairness)
● Provided context
● Word replacement with synonyms
● Other rewording

v
H2O.ai Confidential
Guardrails (Controlling LLM Output)
Provide tools for:
● Avoiding certain topics.
● Avoiding privacy violations
● Avoiding toxicity
● Fact checking
● Avoiding hallucinations
● Avoiding bias

v
H2O.ai Confidential
Guardrails
Achieving more flexible control of LLM output
● Adding instructions resistant to undesirable outcomes
● Screening output for bad behavior

v
H2O.ai Confidential
Conclusions
● Gen AI models have many of the complexities as other models.
● Some methods from unsupervised learning are still useful.
● Unstructured output will also benefit from new methods.

LLM Interpretability

Recommended

Recommended

More Related Content

Similar to LLM Interpretability

Similar to LLM Interpretability (20)

More from Sri Ambati

More from Sri Ambati (20)

Recently uploaded

Recently uploaded (20)

LLM Interpretability