The rise of Large Language Models has revolutionized the landscape of AI, unlocking huge potential across society. However, it has also introduced the challenge of hallucinations - instances where the model generates rather trippy content in a scarily convincing way. Rest assured, Morena will guide you through an exploration of how we can automatically detect these instances of hallucination to fully unleash the potential of LLMs.
3. In the 1980s, Saddam Hussein
was given the key to the city of
Detroit after donating $250,000
to a local church. The church’s
pastor, Jacob Yasso, calls the
former Iraqi president “a very
generous, warm man who just
let too much power go to his
head”.
Justin Bieber's DNA was sent
into space aboard a SpaceX
Falcon 9 rocket in 2021 as part
of a promotional collaboration
between Bieber and a
technology company.
BS
True!
4. Hi, my name is
Morena 👋
● Data Scientist at GetYourGuide
● Co-organiser of MLOps Community
meetups in Berlin
7. 3
7
|
What are hallucinations in
LLMs?
● Generation of content that is
○ factually incorrect
○ nonsensical or unfaithful to the input context
● Intrinsic vs. extrinsic hallucination
○ Example: LLM summarizing Wikipedia page
about Paris
■ Intrinsic: “Paris has a population of 1
million residents”
■ Extrinsic: “Paris is home to the most
successful soccer team in France”
8. 3
8
|
What are hallucinations in
LLMs?
● Hallucinations can be harmful in many ways –
especially if it’s hard to verify the information
○ Major challenge for deploying LLMs in prod
○ Potential harm for society
● Nature of models make them output false
content in a very convincing way
→ It’s becoming increasingly important to be
able to detect hallucinations in a structured and
quantitative way
10. ● Contradicting or false information in training data
● Complexity/novelty of task to perform
● Fundamental nature of the model:
○ LLMs are trained to predict tokens probabilistically
○ Text is broken down in tokens
○ Next token is predicted based on token and position embeddings
Why do LLMs hallucinate?
14. Self-evaluation methods
● Prompt LLM to evaluate its previous prediction
● Evaluating response quality is an easier task than producing the
response
● E.g. text summarization task:
Prompt LLM
"Summarize the
following text: {}"
Prompt LLM
Context: {} Sentence: {}
Is the sentence
supported by the context
above? Answer Yes or No:
Break
response
down into
sentences
● Can be used in combination with other methods to improve reliability
○ Combine with reference-based methods
○ Combine with consistency-based methods, using a sampling
approach
16. Reference-based methods
● Measure generation consistency against provided reference
● Availability of reference depends on use case
○ Open QA
■ lack of references
○ RAG
■ references available through retrieval
○ Text summarization
■ references readily available
Break down
LLM response
into sentences
For each sentence,
calculate similarity
score with reference
e.g.
BERTScore
or ROUGE
18. Uncertainty-based methods
● Leverage existing intrinsic uncertainty metrics to determine parts of
output sequence that the system is least certain of
● Require access to token-level probability distributions
Morena Bastiaansen is a Belgian writer and poet
20. Consistency-based methods:
“If an LLM has knowledge of a given concept, sampled responses are
likely to be similar and contain consistent facts”
● Example (SelfCheckGPT):
○ Let R refer to an LLM response drawn from a given prompt
○ Draw a further N stochastic LLM response samples
{S1
,S2
,...,Sn
,...,SN
} using the same prompt
○ For each sentence rᵢ in R, for each sampled answer Sⁿ, measure
the consistency using some kind of similarity/inconsistency
score
■ BERTScore
■ NLI contradiction score
■ Self-evaluation: Prompt LLM (“Is the sentence supported by
the context above? Answer Yes or No”)
■ …
○ Aggregate these scores to compute the hallucination score of
sentence rᵢ, H(i) such that H(i) ∈ [0.0,1.0], where H(i) → 0.0 if the
i-th sentence is grounded in valid information and H(i) → 1.0 if
the i-th sentence is hallucinated
22. Hallucination detection methods
Uncertainty-based 🤔
Reference-based📚
Self-evaluation 🔍 Consistency-based
🎯
✅Simple and
straightforward to use
✅Useful for a variety of
tasks
❌Not very suitable for
extrinsic hallucinations
✅Useful for text
summarization, RAG
❌Requires references to
be readily available
❌Challenging for open
QA tasks/free text
generation
✅Useful for a variety of
tasks
❌ Requires access to
internal model states
✅Reference-free
✅Works for black box
LLMs
❌ Can be challenging in
real-time use cases
❌ Not useful for all tasks
(e.g. free text generation)
24. Sources
● Manakul, P., Liusie, A., Gales, M. J. F. (2023). SELFCHECKGPT:Zero-Resource
Black-Box Hallucination Detection for Generative Large Language Models
● Amatriain, X. (2024). Measuring and Mitigating Hallucinations in Large Language
Models: A Multifaceted Approach
● McKenna, N., Li, T., Cheng, L., Hosseini, M. J., Johnson, M., & Steedman, M.
(2023). Sources of Hallucination by Large Language Models on Inference Tasks.
● Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W.,
Feng, X., Qin, B., & Liu, T. (2023). A Survey on Hallucination in Large Language
Models: Principles, Taxonomy, Challenges, and Open Questions.
● Liu, T., Zhang, Y., Brockett, C., Mao, Y., Sui, Z., Chen, W., & Dolan, B. (2022). A
Token-level Reference-free Hallucination Detection Benchmark for Free-form Text
Generation.
● Yuan, W., Neubig, G., & Liu, P. (2021). BARTSCORE: Evaluating Generated Text
as Text Generation.
● Xu, Z., Jain, S., & Kankanhalli, M. (2024). Hallucination is Inevitable: An Innate
Limitation of Large Language Models.
● Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Chen, D., Chan,
H., Dai, W., & Madotto, A., Fung, P. (2024). Survey of Hallucination in Natural
Language Generation.
● Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). G-EVAL: NLG
Evaluation using GPT-4 with Better Human Alignment