Hallucinations are a current fundamental problem for LLMs.
For one example, June this year in New York, attorneys did "research" on past cases with ChatGPT and turned it in to the Judge as a brief. The opposing council reported to the judge that they could not find the cases. When the judge confronted the GPT using attorneys, they stood behind their brief. The judge find the firm $5000.
Could this happen to you? YES. What can be done to avoid this in the future? I will answer.
In this talk, I will explain some fundamental areas of LLM's to explain how and why hallucinations occur. To understand that, an introduction into how words, concepts and dialogs are represented will help.
Words were first represented as a point in an embedding space with Word2Vec in 2013. This could compress 10,000 words into a vector of 300 elements, with a word being represented as a point in the 300-dimensional embedding space. Not just words can be represented, but also longer text, such as books can be compressed into a type of embedding. In that situation, areas of embedding space relate to different genres, such as: non-fiction, science fiction, children's fiction and so on. A new data point between training data points, when converted to text, would be a hallucination. In the area of "legal cases" in embedding space, if there is not an exact match, the text generation would try to generate what is plausible.
During an LLM conversation, the output of the previous text provides context for the next text in the style of a recurrent neural network. The starting position of a conversation matters. Understanding areas of weight space represent genres like "non-fiction" or other language aspects, and the starting position of a discussion time series matters, helps to understand why prompt engineering helps. The neural network conversation is represented in the activations of the 7B or 500B weights, a much larger space. During a conversation, learning is not occurring, but neural network activations are changing. The neural network is not a database. Even if you reach the exact set of weight activations from a training record, due to lossy compression, the exact text may not be regenerated.
Chat GPT does not use word embeddings. For implementation efficiency reasons, it is practical to break down what is embedded to about 50,000 items in a lookup table. Also, if we want to support proper nouns, like names, and dozens of languages, the number of words would be in the millions. Chat GPT and other LLMs use "tokens" for embedding. Examples of Byte Pair Encoding (BPE) and its process is given. The ChatGPT embedding is a vector of numbers 1,536 long for each token.
A solution for today is Retrieval Augmented Generation (RAG). As a brief introduction, you can ask with an English or natural question. It can be matched against a large library or database of paragraphs from internal documents or websites.
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Understanding Hallucinations in LLMs - 2023 09 29.pptx
1. Greg Makowski
Head of Data Science Solutions
Cybernator.Net
Friday, September 29, 2023, 10:10 am
Global AI Conference
https://www.globalbigdataconference.com/virtual/global-artificial-intelligence-
conference/schedule-139.html Conference Schedule
https://www.slideshare.net/gregmakowski Slides
www.LinkedIn.com/in/GregMakowski Connect on LinkedIn
Understanding
Hallucinations in LLMs,
and Why Retrieval
Augmented Generation
(RAG) Reduces the Issue
2. Greg Makowski
Friday, September 29, 2023 Understanding Hallucinations in LLMs - Greg Makowski of Cybernator 2
• Goal since high school was “Applied Science Fiction”
• Deploying Data Science and Artificial Intelligence since 1992
• Worked for American Express, then 6 startups
• Been through 4 acquisitions or startup exits
• Deployed ~96 models for clients
• 10 Enterprise AI Applications
• Growing DS teams since 2010
• Applied for 9 DS patents since Jan 2022
3. Friday, September 29, 2023 Understanding Hallucinations in LLMs - Greg Makowski of Cybernator 3
Example Hallucination Problem
Understanding Fundamentals
• Word2Vec – early embedding
• Representing books in embedding
• Recurrent NN feed outputs back into the next time step
• Byte Pair Encoding (about 4 char) embedding for GPT
• Hallucination is a point between training data points
Solutions
• Use LLMs for reasoning, not a corporate knowledge base
• Retrieval Augmented Generation (RAG, for the short term)
• Dr. Yann LeCun’s Objective-Driven AI (long-term)
Agenda
4. Q) How could this happen?
Q) Is this limited to AI in the Law? NO
Q) Why don’t AI people just use factual training data? Would that fix it? NO
Q) What if I used another LLM – would it still be an issue? YES
Q) Could it happen to you, in your company? YES
Q) Will this talk help you avoid this? YES !!
Example Hallucination Problem
New York Lawyers Sanctioned for Using Fake ChatGPT Cases in
Legal Brief
https://www.cnbc.com/2023/06/22/judge-sanctions-lawyers-whose-ai-written-filing-
contained-fake-citations.html
Judge P. Kevin Castel said that the attorneys, Peter LoDuca and Steven
Schwartz (pictured), “abandoned their responsibilities” when they submitted
the A.I.-written brief in their client’s lawsuit… in March, and “then continued
to stand by the fake opinions after judicial orders called their existence into
question”
Friday, September 29, 2023 Understanding Hallucinations in LLMs - Greg Makowski of Cybernator 4
5. Understanding Fundamentals
Word2Vec – an early embedding
• Before, represented one word with one column, with 0/1 val
• Represent a sentence or paragraph
• 10k or 50k words determines the number of inputs to a model
• Problems
• Order of words does not matter
• Same representation for “bank” in “river bank”, “plane banks left”,
“financial bank”
• After, advantages of Word2Vec
• May “compress” 10k sparse columns to 300 columns, with words as “points”
• Based on context of words around “bank”, it discovers different meanings
• Once created, save as a lookup table “word x” “embedding vector x”
• Embedding spaces are used for
• Words, paragraphs, Face recognition, Speaker recognition, Social Networks
Friday, September 29, 2023 Understanding Hallucinations in LLMs - Greg Makowski of Cybernator 5
https://arxiv.org/pdf/1310.4546.pdf
Distributed Representations of Words and Phrases and their Compositionality – 2013
https://medium.com/@zafaralibagh6/a-simple-word2vec-tutorial-61e64e38a6a1
https://github.com/tmikolov/word2vec https://github.com/loretoparisi/word2vec
10k context 300
inputs compressed
The
plane
banks
left
before
6. Embedding of Books – to encode concepts
• In the chart on the right, book embeddings
• Non-Fiction – top right, green
• Science Fiction – left, blue
• Fiction – Orange, middle to lower right
• Starting state setup by LLM PROMPT
ENGINEERING
• This sets the “starting place” to run the
time series dialog with the LLM, i.e. start:
• Non-fiction
• Respectful
Friday, September 29, 2023 Understanding Hallucinations in LLMs - Greg Makowski of Cybernator 6
https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526
T-Distributed Stochastic Neighbor Embedding (TSNE)
Non-fiction.
Next slide,
drill down
7. New Data Points in Embedding Space Cause
New Text to be Generated (Hallucinations)
• HALLUCINATIONS
• New LLM conversations, between existing training
points, which all may be factual
• LLM is not 300 dimensions, but 7B to 500B+
• A new location data point in N dimensions is some
interpolation of surrounding concepts, like the
neighboring COMPRESSED training data points
(i.e. existing legal cases)
• NOT a web search
• NOT information retrieval!
• IT IS Text generation
Friday, September 29, 2023 Understanding Hallucinations in LLMs - Greg Makowski of Cybernator 7
8. ChatGPT
GPT = Generated Pretrained Transformers
Generation is a type of Hallucination
Friday, September 29, 2023 Understanding Hallucinations in LLMs - Greg Makowski of Cybernator 8
9. Embedding: close neighbors can be close concepts
• Local, very close neighbors in the embedding space, are
• Related meanings are used in similar conversations
• Word embeddings a word is a “point” in 300 dimension embedding space
• LLM weight activations answer paragraphs are a “point” in 70B LLM neural
network weight embedding space
• Legal cases group close together, and have a similar format and structure
(prosecution, defense, judgment)
• A new HALLUCINATED case in this embedding space will have a similar
structure but may have new person names in the generated case. This could be
helpful if you are writing a legal thriller.
• EVEN IF only trained on factual legal cases (excluded legal thriller fiction)
Friday, September 29, 2023 Understanding Hallucinations in LLMs - Greg Makowski of Cybernator 9
https://serokell.io/blog/word2vec
Going from embedding concept points
to NEW IMAGES is good and creative.
NOT used to retrieve exact images.
10. How LLM is a time series (Recurrent NN)
• Good at predicting the
most popular or frequent
sequences
• Only as good as the
volume and variety of the
training data
• Diagram shows given the
“start state” of the prior
word is “I”, the next
word is most frequently
“went”
• PROMPT ENGINEERING
in current LLMs
Friday, September 29, 2023 Understanding Hallucinations in LLMs - Greg Makowski of Cybernator 10
https://towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from-100-million-parameters-b49113672f77#:~:text=BERT%20is%20a%20(multi%2Dheaded)%20beast&text=Since%20model%20weights%20are%20not,16%20%3D%20384%20different%20attention%20mechanisms
Hallucinations are just “next probable text”
temperature = 0
Choose the
max prob. Use
for “most
repeatable
results”
Temp = 0.8 or 1.0
Choose a
weighted next
likely item in the
sequence. For
“more creativity”
11. What does ChatGPT use for embeddings?
• Input text is broken into small chunks or “tokens”. A word or parts of the word, combined in the LMM seq.
• A conversation or “context length” may be 2k, 4k, 32k or 100k (Anthropic’s Claude 2)
• https://platform.openai.com/tokenizer
• 1536 dimensions in embedding space q) Is this a good representation of numbers for math?
• May be an average of 4 characters or ¾ of a word
• Represents a lookup table of ~50,000 tokens
• Tokens are combined together to make words
• """what <about> delimiters?"""
• How to represent Spanish?
• Byte Pair Encoding (BPE) https://en.wikipedia.org/wiki/Byte_pair_encoding
• Combine letters that frequently occur next to each other, to determine what tokens are used to create input embeddings
• Space between words connects to the beginning of the next word
• Keep “compressing” the “training sample text” until you end up with N (i.e. 50k) tokens.
• aaabdaaabac (input sample text). See many pairs of “aa”. Replace with a code letter “aa” “Z”
• ZabdZabac observe frequent “ab” pairs. Now use “ab” “Y”
• ZYdZYac observe frequent “ZY”. Now use “ZY” “X”
• XdXac This is how BPE figures out what tokens to encode in an embedding space.
Friday, September 29, 2023 Understanding Hallucinations in LLMs - Greg Makowski of Cybernator 11
• ¿Cómo representar al
español?
q) Why use these groupings of
letters to define a token?
Ans) See BPE
Ans) Scales to a large, multi-lingual
vocabulary, proper nouns (names)
12. Solutions to Hallucinations
Friday, September 29, 2023 Understanding Hallucinations in LLMs - Greg Makowski of Cybernator 12
• Use LLMs for reasoning, NOT a corporate knowledge base
• Good for most common knowledge “Tuesday follows Monday” and “head of the long-tail”
• LLMs are NOT good for
• detecting their own Hallucinations
• data changing quarterly or daily (LLM’s are static),
• long tail, very detailed knowledge, that passes regression testing
• for a specific company and vertical application
• can try Supervised Fine Tuning (SFT) with Quantized Low Ranked Adaption (Q-LoRA)
• Retrieval Augmented Generation (RAG) (today’s solution)
• Objective Driven AI, by Yan LeCun (better solution in the future)
13. • Benefits
• LLM application can query your internal unstructured data (web, docs, …) or structured data (SQL, Snowflake, …)
• As your data updates from one day to the next, the LLM query results will access the updated data
• Gives answer citations! Therefore NOT a hallucination. The reader can investigate further.
• All “data quality control” your organization has, will be in place
• Once you “connect” the LLM application, you don’t have to repeat any expensive SFT or training update every day or week
• To query unstructured data, use embedding DB:
• Setup
• Break your text into paragraphs or chunks, 500-1000 characters. Text chunks may overlap
• Add to the chunk any questions, the chunk answers, for better matching to queries
• Encode with an embedding, save in the EMBEDDING DATABASE
• May save with structured attributes, to narrow down queries
• Query time
• LLM application takes the user text or query, and converts it to a query embedding q) [.02, .06, … .72]
• Use the query embedding to find the best match among document embeddings b) [.03, .05, … .65] (closest Euclidean distance)
low, low, high q) b)
• Hands on Training, using LangChain and ChatGPT with Python
• SF bay ACM has an upcoming class, Sat, Nov 4, Building Enterprise LLM Applications
Friday, September 29, 2023 Understanding Hallucinations in LLMs - Greg Makowski of Cybernator 13
Retrieval Augmented Generation (RAG)
Vec DB search over
all dimensions at
once
a) [.43, .01, … .04]
b) [.03, .05, … .65]
c) [.01, .42, … .02]
14. Friday, September 29, 2023 Understanding Hallucinations in LLMs - Greg Makowski of Cybernator 14
Retrieval Augmented Generation (RAG)
https://arxiv.org/abs/2305.06983 Active Retrieval Augmented Generation – 2023 May
15. Traditional SQL index
• On “last_name + first_name”
• Binary tree, B+ tree
Embedding database index
• On all 300 or 1536 fields at
once, independent of order
• Hash function and other
technologies
Reading
• “Semantic Search with
Embeddings: Index anything” by
Romain Beaumount
https://rom1504.medium.com/semantic-
search-with-embeddings-index-anything-
8fb18556443c
Friday, September 29, 2023 Understanding Hallucinations in LLMs - Greg Makowski of Cybernator 15
RAG: Embedding Database Vendors
https://www.graft.com/blog/top-vector-databases-for-ai-projects
16. Friday, September 29, 2023 Understanding Hallucinations in LLMs - Greg Makowski of Cybernator 16
Objective Driven AI
“Objective-Driven AI, Towards AI Systems that can learn, remember, reason, plan, have common sense,
yet are steerable and safe”
https://drive.google.com/file/d/1wzHohvoSgKGZvzOWqZybjm4M4veKR6t3/view
Yann LeCun, 2023-07-21, New York University and Meta – Fundamental AI Research
17. Friday, September 29, 2023 Understanding Hallucinations in LLMs - Greg Makowski of Cybernator 17
Objective Driven AI
18. Greg Makowski
Head of Data Science Solutions
Cybernator.Net
Friday, September 29, 2023, 10:10 am
Global AI Conference
https://www.globalbigdataconference.com/virtual/global-artificial-intelligence-
conference/schedule-139.html Conference Schedule
https://www.slideshare.net/gregmakowski Slides
www.LinkedIn.com/in/GregMakowski Connect on LinkedIn
QUESTIONS?
Understanding Hallucinations
in LLMs, and Why Retrieval
Augmented Generation (RAG)
Reduces the Issue
“Building Enterprise LLM Applications”,
a class for the day, Sat, Nov 4th
GLOBAL20 for 20% off
Through the local ACM chapter
(non-profit)