10 Limitations of Large Language Models and Mitigation Options

Practical GenAI: Understanding Large Language Models (LLMs)
10 Limitations of LLMs and mitigation options
Mihai Criveti, Principal Architect, CKA, RHCA III
September 10, 2023
1

1. Hallucinations
2. Performance
3. Inference Cost
4. Stale training data
5. Use with private data
6. Token limits / context window size
7. LLMs only support plain text
8. Lack of transparency / explainability
9. Ethical Concerns
10. Training and fine tuning costs
2

Introduction
Mihai Criveti, Principal Architect, Platform Engineering
• Responsible for large scale Cloud Native and AI Solutions
• Red Hat Certified Architect III, CKA/CKS/CKAD
• Drives the development of Inner Source Retrieval Augmentation Generation platforms, and solutions for
Generative AI at IBM that leverage WatsonX, Vector databases, LangChain, HuggingFace and open source AI
models.
Abstract
10 Limitations of Large Language Models and ways to overcome them. Dealing with hallucinations, performance,
costs, stale training data, injecting private data, token limits and contextual memory, text conversion, lack of
transparency, ethical concerns and training costs.
3

1. Hallucinations
Because models are designed to produce coherent and fluent text, LLMs can ‘hallucinate’ and generate text that is
incorrect, but often seems plausible.
Lack of context or contextual understanding of the input prompt are key reasons why LLMs hallucinate.
4

Why hallucinations occur
Lack of context or contextual understanding
• The input prompt is contradictory, or unclear
• The prompt does not provide sufficient examples of the desired output
• The model lacks context to respond to the input, either in it’s dataset or the prompt
Data Quality and Training Method
• The model itself has been trained on biased, noisy, old, low quality or incorrect data
• For example, models trained on ‘twitter data’ or various forums can often contain large sections of incorrect
data
Generation Method
• Models and their weights might be biased towards specific languages, words or data
5

Hallucination Workarounds
Workarounds include advanced prompt engineering
• Adding a prompt such as: If a question does not make any sense, or is not factually
coherent, explain why instead of answering something not correct.
• Provide examples using one-shot prompting or few-shot prompting
And forms of Retrieval Augmented Generation
• Context injection and grounding to use-case-specific sources
• More advanced methods such as Retrieval Augmented Generation using a Vector Database
• Internet or API retrieval connectors and ‘plugins’
Other workarounds
• Using a more performant model that performs better at a given task, or fine tuning the models
• Testing the quality of responses, and providing an alternative model / answer
• Reinforcement learning from human feedback (RLHF).
6

Hallucination Workarounds: Prompting
LLAMA2 Prompt
“You are a helpful, respectful and honest assistant. Always answer as helpfully as
possible, while being safe. Your answers should not include any harmful, unethical,
racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses
are socially unbiased and positive in nature. If a question does not make any sense, or
is not factually coherent, explain why instead of answering something not correct. If
you don't know the answer to a question, please don't share false information."
7

Hallucination Workarounds: Retrieval Augmented Generation
Figure 1: RAG
8

2. Performance
Concerns
• Even the faster models are slower than a dial-up modem, or a fast typist!
• They also suffer from latency or time to first token.
• For most queries, expect 10–20 second response times from most models, and even with streaming, you’ll end
up waiting a few seconds for the first token to be generated!
Workarounds
• Throw money & hardware at the problem: more GPUs
• Use smaller models
• Generate fewer tokens
9

3. Inference Cost
Concerns
• LLMs are expensive to run!
• Some of the top 180B parameter models may need as many as 5xA100 GPUs to run, while even quantized
versions of 70B LLAMA would take up a whole GPU! That’s one query at a time.
• The costs add up. For example, a dedicated A100 might cost as much as $20K a month with a cloud provider! A
brute force approach is going to be expensive.
Workarounds
• Use a quantized model - it trades off output quality for performance. 8-bit, 6-bit or even 4-bit quantization will
help you fit models into smaller, cheaper GPU vRAM, or use fewer GPUs.
• Use a smaller model: a quality,fine-tuned 13B may perform well enough for tasks such as summarization.
10

4. Stale training data
Concern
• Even top models haven’t been trained on ‘recent’ data, and have a cut-off date. Remember, a model doesn’t
‘have access to the internet’.
• While certain ‘plugins’ do offer ‘internet search’, it’s just a form of RAG, where ‘top 10 internet search query
results’ are fed into the prompt as context, for example.
Workarounds
• Using a more recent model
• Retraining the model
• Fine tuning
• Retrieval Augmented Generation
11

5. Use with private data
LLMs haven’t been trained on your private data, and as such, cannot answer questions based on our dataset, unless
that data is inject through fine tuning, or some form prompt engineering including RAG.
12

Concern
• Models are limited by the TOKEN_LIMIT, and most models can process, at best, a few pages of total input/output.
• This means you can’t just feed a model and entire document, and ask for a summary or extract facts from the
document.
Workaround
• You need to chunk documents into pages first, and perform multiple queries.
• Use a model with a larger token limit.
13

Concern
• While this sounds obvious (from the name), it also means you can’t just feed a PDF file or WORD document to a
LLM. You first need to convert that data to text, and chunk it to fit in the token limit, alongside your prompt and
some room for output.
• Conversion to text isn’t perfect. What happens to your images, or tables, or metadata? It also means models
can only output text. Formatting the text to output HTML or DOCX or other rich text formats requires a lot of
heavy lifting in our pipeline.
Mitigation
• Having a good data processing pipeline
• Multi-model approaches
14

Concern
• Why did the model generate a particular answer? While the LLM answer may not necessarily be correct, you can
display the source content that helped generate that answer.
Mitigation
• Content grounding
• Techniques such as RAG can help, as you are able to point at the ‘context’ that generated a particular answer,
and even display the context.
15

9. Ethical Concerns
Concerns
Potential bias, hate, abuse, harm, ethical concerns, etc: sometimes, answers generated by an LLM can be outright
harmful. Using the RAG pattern, in addition to HARM filters can help mitigate some of these issues.
Mitigation
• Using open source models with know data lineage
• HARM filters
• Governance frameworks
• Content grounding
• Reinforcement learning from human feedback (RLHF)
16

Concern
The: “Training Hardware & Carbon Footprint” section from the LLAMA2 paper suggests a total of 3311616 GPU hours
was used to train LLAMA2 (7/13/34 and 70B)!
To put it in perspective, a 70B model like LLAMA2 might need ~2048 A100 GPUs for a month to train, adding up to
$20–40M training cost, not to mention what it takes to download and store the data.
Workaround
• Don’t train your own model: using a pre-trained model
• Open Source and Open Innovation: share learnings and training data, rather than having proprietary models.
17

Contact
This talk can be found on GitHub
• https://github.com/crivetimihai/overcome-llm-limitations
Social media
• https://twitter.com/CrivetiMihai - follow for more LLM content
• https://youtube.com/CrivetiMihai - more LLM videos to follow
• https://www.linkedin.com/in/crivetimihai/
18

10 Limitations of Large Language Models and Mitigation Options

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 10 Limitations of Large Language Models and Mitigation Options

Similar to 10 Limitations of Large Language Models and Mitigation Options (20)

More from Mihai Criveti

More from Mihai Criveti (12)

Recently uploaded

Recently uploaded (20)

10 Limitations of Large Language Models and Mitigation Options