Understanding Hallucinations in LLMs - 2023 09 29.pptx

Greg Makowski
Head of Data Science Solutions
Cybernator.Net
Friday, September 29, 2023, 10:10 am
Global AI Conference
https://www.globalbigdataconference.com/virtual/global-artificial-intelligence-
conference/schedule-139.html Conference Schedule
https://www.slideshare.net/gregmakowski Slides
www.LinkedIn.com/in/GregMakowski Connect on LinkedIn
Understanding
Hallucinations in LLMs,
and Why Retrieval
Augmented Generation
(RAG) Reduces the Issue

Greg Makowski
Friday, September 29, 2023 Understanding Hallucinations in LLMs - Greg Makowski of Cybernator 2
• Goal since high school was “Applied Science Fiction”
• Deploying Data Science and Artificial Intelligence since 1992
• Worked for American Express, then 6 startups
• Been through 4 acquisitions or startup exits
• Deployed ~96 models for clients
• 10 Enterprise AI Applications
• Growing DS teams since 2010
• Applied for 9 DS patents since Jan 2022

Example Hallucination Problem
Understanding Fundamentals
• Word2Vec – early embedding
• Representing books in embedding
• Recurrent NN feed outputs back into the next time step
• Byte Pair Encoding (about 4 char) embedding for GPT
• Hallucination is a point between training data points
Solutions
• Use LLMs for reasoning, not a corporate knowledge base
• Retrieval Augmented Generation (RAG, for the short term)
• Dr. Yann LeCun’s Objective-Driven AI (long-term)
Agenda

Q) How could this happen?
Q) Is this limited to AI in the Law? NO
Q) Why don’t AI people just use factual training data? Would that fix it? NO
Q) What if I used another LLM – would it still be an issue? YES
Q) Could it happen to you, in your company? YES
Q) Will this talk help you avoid this? YES !!
Example Hallucination Problem
New York Lawyers Sanctioned for Using Fake ChatGPT Cases in
Legal Brief
https://www.cnbc.com/2023/06/22/judge-sanctions-lawyers-whose-ai-written-filing-
contained-fake-citations.html
Judge P. Kevin Castel said that the attorneys, Peter LoDuca and Steven
Schwartz (pictured), “abandoned their responsibilities” when they submitted
the A.I.-written brief in their client’s lawsuit… in March, and “then continued
to stand by the fake opinions after judicial orders called their existence into
question”

Understanding Fundamentals
Word2Vec – an early embedding
• Before, represented one word with one column, with 0/1 val
• Represent a sentence or paragraph
• 10k or 50k words determines the number of inputs to a model
• Problems
• Order of words does not matter
• Same representation for “bank” in “river bank”, “plane banks left”,
“financial bank”
• After, advantages of Word2Vec
• May “compress” 10k sparse columns to 300 columns, with words as “points”
• Based on context of words around “bank”, it discovers different meanings
• Once created, save as a lookup table “word x”  “embedding vector x”
• Embedding spaces are used for
• Words, paragraphs, Face recognition, Speaker recognition, Social Networks
https://arxiv.org/pdf/1310.4546.pdf
Distributed Representations of Words and Phrases and their Compositionality – 2013
https://medium.com/@zafaralibagh6/a-simple-word2vec-tutorial-61e64e38a6a1
https://github.com/tmikolov/word2vec https://github.com/loretoparisi/word2vec
10k context 300
inputs compressed
The
plane
banks
left
before

Embedding of Books – to encode concepts
• In the chart on the right, book embeddings
• Non-Fiction – top right, green
• Science Fiction – left, blue
• Fiction – Orange, middle to lower right
• Starting state setup by LLM PROMPT
ENGINEERING
• This sets the “starting place” to run the
time series dialog with the LLM, i.e. start:
• Non-fiction
• Respectful
https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526
T-Distributed Stochastic Neighbor Embedding (TSNE)
Non-fiction.
Next slide,
drill down

New Data Points in Embedding Space Cause
New Text to be Generated (Hallucinations)
• HALLUCINATIONS
• New LLM conversations, between existing training
points, which all may be factual
• LLM is not 300 dimensions, but 7B to 500B+
• A new location data point in N dimensions is some
interpolation of surrounding concepts, like the
neighboring COMPRESSED training data points
(i.e. existing legal cases)
• NOT a web search
• NOT information retrieval!
• IT IS Text generation

ChatGPT
GPT = Generated Pretrained Transformers
Generation is a type of Hallucination

Embedding: close neighbors can be close concepts
• Local, very close neighbors in the embedding space, are
• Related meanings are used in similar conversations
• Word embeddings  a word is a “point” in 300 dimension embedding space
• LLM weight activations  answer paragraphs are a “point” in 70B LLM neural
network weight embedding space
• Legal cases group close together, and have a similar format and structure
(prosecution, defense, judgment)
• A new HALLUCINATED case in this embedding space will have a similar
structure but may have new person names in the generated case. This could be
helpful if you are writing a legal thriller.
• EVEN IF only trained on factual legal cases (excluded legal thriller fiction)
https://serokell.io/blog/word2vec
Going from embedding concept points
to NEW IMAGES is good and creative.
NOT used to retrieve exact images.

How LLM is a time series (Recurrent NN)
• Good at predicting the
most popular or frequent
sequences
• Only as good as the
volume and variety of the
training data
• Diagram shows given the
“start state” of the prior
word is “I”, the next
word is most frequently
“went”
• PROMPT ENGINEERING
in current LLMs
https://towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from-100-million-parameters-b49113672f77#:~:text=BERT%20is%20a%20(multi%2Dheaded)%20beast&text=Since%20model%20weights%20are%20not,16%20%3D%20384%20different%20attention%20mechanisms
Hallucinations are just “next probable text”
temperature = 0
 Choose the
max prob. Use
for “most
repeatable
results”
Temp = 0.8 or 1.0
 Choose a
weighted next
likely item in the
sequence. For
“more creativity”

What does ChatGPT use for embeddings?
• Input text is broken into small chunks or “tokens”. A word or parts of the word, combined in the LMM seq.
• A conversation or “context length” may be 2k, 4k, 32k or 100k (Anthropic’s Claude 2)
• https://platform.openai.com/tokenizer
• 1536 dimensions in embedding space q) Is this a good representation of numbers for math?
• May be an average of 4 characters or ¾ of a word
• Represents a lookup table of ~50,000 tokens
• Tokens are combined together to make words
• """what <about> delimiters?"""
• How to represent Spanish?
• Byte Pair Encoding (BPE) https://en.wikipedia.org/wiki/Byte_pair_encoding
• Combine letters that frequently occur next to each other, to determine what tokens are used to create input embeddings
• Space between words connects to the beginning of the next word
• Keep “compressing” the “training sample text” until you end up with N (i.e. 50k) tokens.
• aaabdaaabac (input sample text). See many pairs of “aa”. Replace with a code letter “aa”  “Z”
• ZabdZabac observe frequent “ab” pairs. Now use “ab”  “Y”
• ZYdZYac observe frequent “ZY”. Now use “ZY”  “X”
• XdXac This is how BPE figures out what tokens to encode in an embedding space.
• ¿Cómo representar al
español?
q) Why use these groupings of
letters to define a token?
Ans) See BPE
Ans) Scales to a large, multi-lingual
vocabulary, proper nouns (names)

Solutions to Hallucinations
• Use LLMs for reasoning, NOT a corporate knowledge base
• Good for most common knowledge “Tuesday follows Monday” and “head of the long-tail”
• LLMs are NOT good for
• detecting their own Hallucinations
• data changing quarterly or daily (LLM’s are static),
• long tail, very detailed knowledge, that passes regression testing
• for a specific company and vertical application
• can try Supervised Fine Tuning (SFT) with Quantized Low Ranked Adaption (Q-LoRA)
• Retrieval Augmented Generation (RAG) (today’s solution)
• Objective Driven AI, by Yan LeCun (better solution in the future)

• Benefits
• LLM application can query your internal unstructured data (web, docs, …) or structured data (SQL, Snowflake, …)
• As your data updates from one day to the next, the LLM query results will access the updated data
• Gives answer citations! Therefore NOT a hallucination. The reader can investigate further.
• All “data quality control” your organization has, will be in place
• Once you “connect” the LLM application, you don’t have to repeat any expensive SFT or training update every day or week
• To query unstructured data, use embedding DB:
• Setup
• Break your text into paragraphs or chunks, 500-1000 characters. Text chunks may overlap
• Add to the chunk any questions, the chunk answers, for better matching to queries
• Encode with an embedding, save in the EMBEDDING DATABASE
• May save with structured attributes, to narrow down queries
• Query time
• LLM application takes the user text or query, and converts it to a query embedding q) [.02, .06, … .72]
• Use the query embedding to find the best match among document embeddings b) [.03, .05, … .65] (closest Euclidean distance)
low, low, high q)  b)
• Hands on Training, using LangChain and ChatGPT with Python
• SF bay ACM has an upcoming class, Sat, Nov 4, Building Enterprise LLM Applications
Retrieval Augmented Generation (RAG)
Vec DB search over
all dimensions at
once
a) [.43, .01, … .04]
b) [.03, .05, … .65]
c) [.01, .42, … .02]

Retrieval Augmented Generation (RAG)
https://arxiv.org/abs/2305.06983 Active Retrieval Augmented Generation – 2023 May

Traditional SQL index
• On “last_name + first_name”
• Binary tree, B+ tree
Embedding database index
• On all 300 or 1536 fields at
once, independent of order
• Hash function and other
technologies
Reading
• “Semantic Search with
Embeddings: Index anything” by
Romain Beaumount
https://rom1504.medium.com/semantic-
search-with-embeddings-index-anything-
8fb18556443c
RAG: Embedding Database Vendors
https://www.graft.com/blog/top-vector-databases-for-ai-projects

Objective Driven AI
“Objective-Driven AI, Towards AI Systems that can learn, remember, reason, plan, have common sense,
yet are steerable and safe”
https://drive.google.com/file/d/1wzHohvoSgKGZvzOWqZybjm4M4veKR6t3/view
Yann LeCun, 2023-07-21, New York University and Meta – Fundamental AI Research

Objective Driven AI

Greg Makowski
Head of Data Science Solutions
Cybernator.Net
Friday, September 29, 2023, 10:10 am
Global AI Conference
https://www.globalbigdataconference.com/virtual/global-artificial-intelligence-
conference/schedule-139.html Conference Schedule
https://www.slideshare.net/gregmakowski Slides
www.LinkedIn.com/in/GregMakowski Connect on LinkedIn
QUESTIONS?
Understanding Hallucinations
in LLMs, and Why Retrieval
Augmented Generation (RAG)
Reduces the Issue
“Building Enterprise LLM Applications”,
a class for the day, Sat, Nov 4th
GLOBAL20 for 20% off
Through the local ACM chapter
(non-profit)

Understanding Hallucinations in LLMs - 2023 09 29.pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Understanding Hallucinations in LLMs - 2023 09 29.pptx

Similar to Understanding Hallucinations in LLMs - 2023 09 29.pptx (20)

More from Greg Makowski

More from Greg Makowski (16)

Recently uploaded

Recently uploaded (20)

Understanding Hallucinations in LLMs - 2023 09 29.pptx