Fact vs. Fiction: Autodetecting Hallucinations in LLMs

Fact vs. Fiction:
Auto-detecting Hallucinations in
LLMs
Morena Bastiaansen, 16.04.2024

Let’s start with a
little game…

In the 1980s, Saddam Hussein
was given the key to the city of
Detroit after donating $250,000
to a local church. The church’s
pastor, Jacob Yasso, calls the
former Iraqi president “a very
generous, warm man who just
let too much power go to his
head”.
Justin Bieber's DNA was sent
into space aboard a SpaceX
Falcon 9 rocket in 2021 as part
of a promotional collaboration
between Bieber and a
technology company.
BS
True!

Hi, my name is
Morena 👋
● Data Scientist at GetYourGuide
● Co-organiser of MLOps Community
meetups in Berlin

01
02
03
04
Agenda
What?
What are hallucinations in LLMs?
Why?
Why do LLMs hallucinate?
Methods
Different hallucination detection
methods
Summary

What are
hallucinations
in LLMs?
01

3
7
|
What are hallucinations in
LLMs?
● Generation of content that is
○ factually incorrect
○ nonsensical or unfaithful to the input context
● Intrinsic vs. extrinsic hallucination
○ Example: LLM summarizing Wikipedia page
about Paris
■ Intrinsic: “Paris has a population of 1
million residents”
■ Extrinsic: “Paris is home to the most
successful soccer team in France”

3
8
|
What are hallucinations in
LLMs?
● Hallucinations can be harmful in many ways –
especially if it’s hard to verify the information
○ Major challenge for deploying LLMs in prod
○ Potential harm for society
● Nature of models make them output false
content in a very convincing way
→ It’s becoming increasingly important to be
able to detect hallucinations in a structured and
quantitative way

● Contradicting or false information in training data
● Complexity/novelty of task to perform
● Fundamental nature of the model:
○ LLMs are trained to predict tokens probabilistically
○ Text is broken down in tokens
○ Next token is predicted based on token and position embeddings
Why do LLMs hallucinate?

Hallucination
detection
methods
03

Hallucination detection methods
Uncertainty-based 🤔
● Leverage existing intrinsic
uncertainty metrics
● E.g. G-Eval with probability
normalization
Reference-based📚
● Measure generation
consistency against
provided reference
● E.g. RAGAS
Self-evaluation 🔍
● Prompt LLM to evaluate
its previous prediction
● E.g. Arize's Phoenix Evals
Consistency-based
🎯
● Stochastic sampling of
responses
● E.g. SelfCheckGPT

13 Presentation Template
Self-evaluation methods

Self-evaluation methods
● Prompt LLM to evaluate its previous prediction
● Evaluating response quality is an easier task than producing the
response
● E.g. text summarization task:
Prompt LLM
"Summarize the
following text: {}"
Prompt LLM
Context: {} Sentence: {}
Is the sentence
supported by the context
above? Answer Yes or No:
Break
response
down into
sentences
● Can be used in combination with other methods to improve reliability
○ Combine with reference-based methods
○ Combine with consistency-based methods, using a sampling
approach

Reference-based methods
● Measure generation consistency against provided reference
● Availability of reference depends on use case
○ Open QA
■ lack of references
○ RAG
■ references available through retrieval
○ Text summarization
■ references readily available
Break down
LLM response
into sentences
For each sentence,
calculate similarity
score with reference
e.g.
BERTScore
or ROUGE

Uncertainty-based methods
● Leverage existing intrinsic uncertainty metrics to determine parts of
output sequence that the system is least certain of
● Require access to token-level probability distributions
Morena Bastiaansen is a Belgian writer and poet

Consistency-based methods:
“If an LLM has knowledge of a given concept, sampled responses are
likely to be similar and contain consistent facts”
● Example (SelfCheckGPT):
○ Let R refer to an LLM response drawn from a given prompt
○ Draw a further N stochastic LLM response samples
{S1
,S2
,...,Sn
,...,SN
} using the same prompt
○ For each sentence rᵢ in R, for each sampled answer Sⁿ, measure
the consistency using some kind of similarity/inconsistency
score
■ BERTScore
■ NLI contradiction score
■ Self-evaluation: Prompt LLM (“Is the sentence supported by
the context above? Answer Yes or No”)
■ …
○ Aggregate these scores to compute the hallucination score of
sentence rᵢ, H(i) such that H(i) ∈ [0.0,1.0], where H(i) → 0.0 if the
i-th sentence is grounded in valid information and H(i) → 1.0 if
the i-th sentence is hallucinated

3
21
|
SelfCheckGPT:
Demo
● Demo

Hallucination detection methods
Uncertainty-based 🤔
Reference-based📚
Self-evaluation 🔍 Consistency-based
🎯
✅Simple and
straightforward to use
✅Useful for a variety of
tasks
❌Not very suitable for
extrinsic hallucinations
✅Useful for text
summarization, RAG
❌Requires references to
be readily available
❌Challenging for open
QA tasks/free text
generation
✅Useful for a variety of
tasks
❌ Requires access to
internal model states
✅Reference-free
✅Works for black box
LLMs
❌ Can be challenging in
real-time use cases
❌ Not useful for all tasks
(e.g. free text generation)

Sources
● Manakul, P., Liusie, A., Gales, M. J. F. (2023). SELFCHECKGPT:Zero-Resource
Black-Box Hallucination Detection for Generative Large Language Models
● Amatriain, X. (2024). Measuring and Mitigating Hallucinations in Large Language
Models: A Multifaceted Approach
● McKenna, N., Li, T., Cheng, L., Hosseini, M. J., Johnson, M., & Steedman, M.
(2023). Sources of Hallucination by Large Language Models on Inference Tasks.
● Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W.,
Feng, X., Qin, B., & Liu, T. (2023). A Survey on Hallucination in Large Language
Models: Principles, Taxonomy, Challenges, and Open Questions.
● Liu, T., Zhang, Y., Brockett, C., Mao, Y., Sui, Z., Chen, W., & Dolan, B. (2022). A
Token-level Reference-free Hallucination Detection Benchmark for Free-form Text
Generation.
● Yuan, W., Neubig, G., & Liu, P. (2021). BARTSCORE: Evaluating Generated Text
as Text Generation.
● Xu, Z., Jain, S., & Kankanhalli, M. (2024). Hallucination is Inevitable: An Innate
Limitation of Large Language Models.
● Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Chen, D., Chan,
H., Dai, W., & Madotto, A., Fung, P. (2024). Survey of Hallucination in Natural
Language Generation.
● Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). G-EVAL: NLG
Evaluation using GPT-4 with Better Human Alignment

Fact vs. Fiction: Autodetecting Hallucinations in LLMs

Recommended

Recommended

More Related Content

Similar to Fact vs. Fiction: Autodetecting Hallucinations in LLMs

Similar to Fact vs. Fiction: Autodetecting Hallucinations in LLMs (20)

More from Zilliz

More from Zilliz (15)

Recently uploaded

Recently uploaded (20)

Fact vs. Fiction: Autodetecting Hallucinations in LLMs