Introduction to LLMs

Intro to
LLMs
Loic Merckel
linkedin.com/in/merckel
Frankfurt am Main, Germany
8 November 2023

1966: ELIZA
Image source: en.wikipedia.org/wiki/ELIZA#/media/File:ELIZA_conversation.png
“While ELIZA was capable of
engaging in discourse, it
could not converse with true
understanding. However,
many early users were
convinced of ELIZA's
intelligence and
understanding, despite
Weizenbaum's insistence to
the contrary.”
Source: en.wikipedia.org/wiki/ELIZA (and
references therein).

2005: SCIgen - An Automatic CS Paper Generator
nature.com/articles/d41586-021-01436-7
news.mit.edu/2015/how-three-mit-students-fooled-scientiﬁc-journals-0414
A project using a rather rudimentary technology that aimed to "maximize amusement, rather than coherence" is
still the cause of troubles today...
pdos.csail.mit.edu/archive/scigen

2017: Google Revolutionized Text Generation
■ Vaswani (2017), Attention Is All You Need (doi.org/10.48550/arXiv.1706.03762)
■ openai.com/research/better-language-models
Image generated with DALL.E: “A small robot standing on the
shoulder of a giant robot” (and slightly modiﬁed with The Gimp)
OpenAI’s Generative Pre-trained
Transformer (DALL.E, 2021; ChatGPT,
2022), as the name suggests, reposes on
Transformers.
Google introduced the Transformer,
which rapidly became the state-of-the-art
approach to solve most NLP problems.

● Kiela et al. (2021), Dynabench: Rethinking Benchmarking in NLP: arxiv.org/abs/2104.14337
● Roser (2022), The brief history of artiﬁcial intelligence: The world has changed fast – what might be next?: ourworldindata.org/brief-history-of-ai
Transformers
2017
Text and shapes in blue have been added to the original work from Max Roser.

What Are Transformers?
Source: Vaswani (2017), Attention Is All You Need
(doi.org/10.48550/arXiv.1706.03762)
Generative (deep learning) models for understanding and generating text,
images and many other types of data.
Transformers analyze chunks of data, called "tokens" and learn to predict
the next token in a sequence, based on previous and, if available, following
tokens.
The auto-regressive concept means that the output of the model, such as
the prediction of a word in a sentence, is inﬂuenced by the previous words it
has generated.
Music—MusicLM (Google) and Jukebox (OpenAI) generate music from text.
Image—Imagen (Google) and DALL.E (OpenAI) generate novel images from text.
Texte—OpenAI’s GPT has become widely known, but other players have similar technology
(including Google, Meta, Anthropic and others).
Others—Recommender (movies, books, ﬂight destinations), drug discovery…
Models that learn from a given dataset how to
generate new data instances.

2022: ChatGPT
“ChatGPT, the popular chatbot
from OpenAI, is estimated to have
reached 100 million monthly
active users in January, just two
months after launch, making it the
fastest-growing consumer
application in history”
statista.com/chart/29174/time-to-one-million-users
Reuters, Feb 1, 2023
https://reut.rs/3yQNlGo

The Mushrooming of Transformer-Based LLMs
PaML (540b), LaMDA
(137b) and others.
OPT-IML (175b), Galactica
(120b), BlenderBot3
(175b), Llama 2 (70b)
ERNIE 3.0 Titan (260b)
GPT-3 (175b), GPT-3.5 (?b),
GPT-4 (?b)
BLOOM (176b)
PanGu-𝛼 (200b)
Jurassic-1 (178b), Jurassic-2 (?b)
Exaone (300b)
Megatron-Turing NLG (530b)
(It appears that all those models rely only on
transformer-based decoders)
Claude (?b), Claude 2 (?b)

Source:
github.com/Mooler0410/LLMsPracti
calGuide

In Finance…
bloomberg.com/news/articles/2023-03-07/griﬃn-says-trying-to-negotiate-enterprise-wide-chatgpt-license bloomberg.com/company/press/bloomberggpt-50-billion-parameter-llm-tuned-ﬁnance

AI Mentions Boost Stock Prices
● AI-mentioning companies:
+4.6% avg. stock price
increase (nearly double of the
non-mentioning).
● In general, 67% of companies
that mentioned AI observed an
increase in their stock prices
→ +8.5% on average.
● Tech companies:
71% → +11.9% on avg.
● Non-tech companies:
65% → +6.8% on avg.
- Mentions of "AI" and related terms (machine learning, automation, robots, etc.).
- S&P 500 companies in 2023.
- 3-day change from the date the earnings call transcript was published. Source: wallstreetzen.com/blog/ai-mention-moves-stock-prices-2023

GPUs Demand Skyrockets
Before LLMs, GPUs were primarily needed for training, and
CPUs were used for inference. However, with the emergence
of LLMs, GPUs have become almost essential for both tasks.
Paraphrasing Brannin McBee, co-founder of CoreWeave, in
Bloomberg Podcast*:
While you may train the model using 10,000 GPUs, the real
challenge arises when you need 1 million GPUs to meet the
entire inference demand. This surge in demand is expected
during the initial one to two years after the launch, and it's likely
to keep growing thereafter.
* How to Build the Ultimate GPU Cloud to Power AI | Odd Lots (youtube.com/watch?v=9OOn6u6GIqk&t=1308s)

Enhancing Productivity With Generative AI?
nature.com/articles/d41586-023-02270-9
science.org/doi/10.1126/science.adh2586

McKinsey & Goldman Are Rather Bullish
mckinsey.com/mgi/overview/in-the-news/ai-
could-increase-corporate-proﬁts-by-4-trillion-
a-year-according-to-new-research
goldmansachs.com/intelligence/pages/gene
rative-ai-could-raise-global-gdp-by-7-percent.
html

● AT&T started to investigate “Mobile
Telephony” in 1980.
● McKinsey projected then that the size
of mobile phone market in 2000
should be < 1 Million subscribers.
● It turned out that the size was > 120
Million, and several billion today…
Sources:
● Cutting the cord, economist.com/node/246152
● Statistics, itu.int/en/ITU-D/Statistics/Pages/stat/default.aspx
● Danny Ralph & Marc Jansen, Lecture Slides 2015, Management Science, Judge Business School, The University of Cambridge, UK
Forecasts by Big Names Are Not Always Reliable

Beware of “Hallucinations” Which Do Remain Very Real
“Hallucinations” are “conﬁdent
statements that are not true”1
.
For the moment, this
phenomenon inexorably
affects all known LLMs.
1: fr.wikipedia.org/wiki/Hallucination_(intelligence_artiﬁcielle)
Yves Montand in “Le Cercle Rouge” during an attack of delirium tremens
This thing probably doesn't exist.

Concrete
Hallucinations (GPT-4)
We asked ChatGPT the ﬁrst part of the third
question of the British Mathematical Olympiad
1977: bmos.ukmt.org.uk/home/bmo-1977.pdf
Is that so? Although not an obvious
hallucination, it may remind us of Fermat’s
lack of space in the margin to give the proof
of his last theorem… Perhaps here there is a
lack of tokens?
Here a total hallucination, this statement is
evidently false.
Perhaps it meant “the
product of two negative
numbers”
Here a total hallucination, this statement is
evidently false. (Although in this case the
inequality is indeed clearly true.)

The Saga of the Lawyer Who Used ChatGPT
nytimes.com/2023/06/08/nyregion/law
yer-chatgpt-sanctions.html
nytimes.com/2023/05/27/nyregion/avia
nca-airline-lawsuit-chatgpt.html
nytimes.com/2023/06/22/nyregion/la
wyers-chatgpt-schwartz-loduca.html

ChatGPT: Achieving Human-Level Performance in
Professional and Academic Benchmarks
● GPT-4's performance in recent tests is
undeniably impressive.
● Study conducted by OpenAI
(openai.com/papers/gpt-4.pdf).
● Most of those tests mainly focus on high
school-level content.
● Many are prepared through test prep
courses and resources.
● By contrast, university exams typically
require a deeper understanding of course
material and critical thinking skills.
● Uniform Bar Exam: Worth noting, but
potential overestimation concerns (see
dx.doi.org/10.2139/ssrn.4441311).

Exploring the MIT Mathematics and EECS Curriculum Using
Large Language Models
Published on Jun 15, 2023
Authors: Sarah J. Zhang, Samuel Florin, Ariel N. Lee, Eamon Niknafs, Andrei Marginean, Annie Wang, Keith
Tyser, Zad Chin, Yann Hicke, Nikhil Singh, Madeleine Udell, Yoon Kim, Tonio Buonassisi, Armando
Solar-Lezama, Iddo Drori
Abstract
We curate a comprehensive dataset of 4,550 questions and solutions from problem sets,
midterm exams, and final exams across all MIT Mathematics and Electrical Engineering and
Computer Science (EECS) courses required for obtaining a degree. We evaluate the ability of
large language models to fulfill the graduation requirements for any MIT major in Mathematics
and EECS. Our results demonstrate that GPT-3.5 successfully solves a third of the entire MIT
curriculum, while GPT-4, with prompt engineering, achieves a perfect solve rate on a test set
excluding questions based on images. We fine-tune an open-source large language model on
this dataset. We employ GPT-4 to automatically grade model responses, providing a detailed
performance breakdown by course, question, and answer type. By embedding questions in a
low-dimensional space, we explore the relationships between questions, topics, and classes and
discover which questions and classes are required for solving other questions and classes
through few-shot learning. Our analysis offers valuable insights into course prerequisites and
curriculum design, highlighting language models' potential for learning and improving
Mathematics and EECS education.
Source: arxiv.org/abs/2306.08997
i.e., GPT-4
scored 100% on
MIT EECS
Curriculum
(Electrical
Engineering and
Computer
Science)

“No, GPT4 can’t ace MIT”
Three MIT undergrads have debunked the myth.
- 4% of the questions were unsolvable. (How did GPT-4 achieve 100%?)
- Information leak in some few-shot prompts: for those, the answer was
quasi-given in the question.
- The automatic grading using GPT-4 itself has some severe issues: prompt
cascade that reprompted (many times) when the given answer was deemed
incorrect. 16% of the questions were multi-choices questions, hence a
quasi-guaranteed correct response.
- Bugs found in the research script that raise serious questions regarding the
soundness of the study.
Source: ﬂower-nutria-41d.notion.site/No-GPT4-can-t-ace-MIT-b27e6796ab5a48368127a98216c76864
Note: The paper has since been withdrawn (see oﬃcial statement at people.csail.mit.edu/asolar/CoursesPaperStatement.pdf)

Chemistry May Not Be ChatGPT Cup of Tea
A study conducted by three researchers of the University of
Hertfordshire (UK) showed that ChatGPT is not a fan of
chemistry.
Real exams were used, and the authors note that “[a] well-written
question item aims to create intellectual challenge and to require
interpretation and inquiry. Questions that cannot be easily
‘Googled’ or easily answered through a single click in an
internet search engine is a focus.”
“The overall grade on the year 1 paper calculated from the top
four graded answers would be 34.1%, which does not meet the
pass criteria. The overall grade on the year 2 paper would be
18.3%, which does not meet the pass criteria.”
Source: Fergus et al., 2023, Evaluating Academic Answers Generated Using ChatGPT (pubs.acs.org/doi/10.1021/acs.jchemed.3c00087)

The “Drift” Phenomenon
Sources:
- wsj.com/articles/chatgpt-openai-math-artiﬁcial-intelligence-8aba83f0
- Chaîne et al., 2023, arxiv.org/abs/2307.09009
● New research from Stanford and UC Berkeley
highlights a fundamental challenge in AI
development: "drift."
● Drift occurs when improving one aspect of
complex AI models leads to a decline in
performance in other areas.
● ChatGPT has shown deterioration in basic math
operations despite advancements in other tasks.
● GPT-4 exhibits reduced responsiveness to
chain-of-thought prompting (may be intended to
mitigate potential misuse with malicious
prompts).
The “behavior of the ‘same’ LLM service can
change substantially in a relatively short amount of
time, highlighting the need for continuous monitoring
of LLMs” (Chain et al., 2023).

Tailoring LLMs to
Speciﬁc Problems

First We Must Have a Problem to Solve…
Source: DeepLearning.AI, licensed under CC BY-SA 2.0

Then We Need a Model
Commercial APIs
- Google, OpenAI, Anthropic, Microsoft...
- Privacy concerns may arise.
- No specific hardware requirement.
- Prompt engineering (OpenAI offers prompt fine-tuning).
Use a foundation model (many open sources models are available)
- As it is (prompt engineering),
- or fine-tuned (either full or parameter efficient fine-tuning).
- May required specific hardware/infrastructure for hosting, fine-tuning and
inferences.
Train a model from the scratch
- Requires huge resources (both data and computing power).
- (e.g., BloombergGPT, arxiv.org/abs/2303.17564.)

A Plethora of Open
Source Pre-Trained
Models
huggingface.co/models
Models should be selected
depending on:
● The problem at hand.
● The strength of the model.
● The operating costs (larger
models require more
resources).
● Other considerations (e.g.,
license).

Prompt Engineering: “Query Crafting”
Improving the output with actions like phrasing
queries, specifying styles, providing context, or
assigning roles (e.g., 'Act as a mathematics
teacher') (Wikipedia, 2023).
Some hints can be found in OpenAI’s “GPT best
practices” (OpenAi, 2023).
Chain-of-thought: popular technique consisting
in “guiding [LLMs] to produce a sequence of
intermediate steps before giving the ﬁnal answer”
(Wei et al., 2022).
Sources:
- Wei, J.et al., 2022. Emergent abilities of large language models, arxiv.org/abs/2206.07682
- OpenAI, 2023, platform.openai.com/docs/guides/gpt-best-practices/six-strategies-for-getting-better-results
- Wikipedia, 2023, , Prompt Engineering, en.wikipedia.org/wiki/Prompt_engineering
(graph from Wei et al., 2022)
About GSM8K benchmark: arxiv.org/abs/2110.14168

Prompt Engineering: In-Context Learning (ICL)
In-Context Learning (ICL) consists in “a few input-output
examples in the model’s context (input) as a preamble
before asking the model to perform the task for an unseen
inference-time example” (Wei et al., 2022).
It is a kind of “ephemeral supervised learning.”
- Zero-shot prompting or Zero-shot learning: no example
given (for largest LLMs, smaller ones may struggle).
- One-shot prompting: one example provided.
- Few-shot prompting: a few examples (typically 3~6).
⚠ Context window limits (e.g., 4096 tokens).
Tweet: @lufthansa Please ﬁnd our
missing luggage!!
Sentiment: negative
Tweet: Will be on LH to FRA very soon.
Cheers!
Sentiment: positive
Tweet: Refused to compensate me for 2
days cancelled ﬂights . Joke of a airline
Sentiment:
LLM
negative
Example of an input and
output for two-shot prompting
Source: Wei, J.et al., 2022. Emergent abilities of large language models, arxiv.org/abs/2206.07682

Fine-Tuning: Introduction
Few shot learning:
- May not be sufficient for smaller models.
- Consumes tokens from the context window.
Fine-tuning is a supervised learning process
that leads to a new model (in contrast with
in-context learning that is “ephemeral”).
Task specific prompt-completion pairs data are
required.
Base LLM
Fine-tuned
LLM
(Prompt_1, completion_1)
(Prompt_2, completion_2)
…
(Prompt_n, completion_n)
Task specific prompt-completion
pairs data

Full Fine-Tuning: Updating All Parameters
Fine-tuning very often means “instruction fine-tuning.”
Instruction fine-tuning: each prompt-completion pair includes a specific
instruction (summarize this, translate that, classify this tweet, …).
● Fine-tuning on a single task (e.g, summarization) may lead to a phenomenon
referred to as “catastrophic forgetting” (arxiv.org/pdf/1911.00202), where the
model loses its abilities on other tasks (may not be a business issue, though).
● Fine-tuning on multi tasks (e.g., summarization, translation, classification, …).
This requires a lot more training data. (E.g., see FLAN in Wei et al., 2022.)
Full fine-tuning is extremely resources demanding, even more so for large models.
Source: Wei et al., 2022, Finetuned Language Models Are Zero-Shot Learners. arxiv.org/abs/2109.01652

Parameter Efficient Fine-Tuning (PEFT)
Unlike full fine-tuning, PEFT preserves the vast majority of the weights of the original
model.
● Less prone to “catastrophic forgetting” on single task.
● Often a single GPU is enough.
Three methods:
● Selective—subset of initial params to fine-tune.
● Reparameterization—reparameterize model weights using a low-rank
representation, e.g., LoRA (Hu et al., 2021).
● Additive—add trainable layers or parameters to model, two approaches:
- Adapters: add new trainable layers to the architecture of the model.
- Soft prompts: focus on manipulating the input (this is not prompt engineering).
Source:
- coursera.org/learn/generative-ai-with-llms/lecture/rCE9r/parameter-efficient-fine-tuning-peft
- Hu et al., 2021, LoRA: Low-Rank Adaptation of Large Language Models. arxiv.org/abs/2106.09685

OpenAI API offers
prompt tuning for
gpt-3.5-turbo, but not
“yet” for GPT-4.
platform.openai.com/docs/guides/ﬁne-tuning
Fine-Tuning With
OpenAI GPT
(PEFT)

Reinforcement Learning From Human Feedback
LLMs are trained on the web data with a lot of irrelevant matters (unhelpful), or worse,
where false (dishonest) and/or harmful information are abundant, e.g.,
● Potentially dangerous false medical advices.
● Valid techniques for illegal activities (hacking, deceiving, building weapons, …).
HHH (Helpful, Honest & Harmless) alignment (Askell et al., 2021): ensuring that the
model's behavior and outputs are consistent with human values, intentions, and ethical
standards.
Reinforcement Learning from Human Feedback, or RLHF (Casper et al., 2023)
● “is a technique for training AI systems to align with human goals.”
● “[It] has emerged as the central method used to ﬁnetune state-of-the-art [LLMs].”
● It reposes on human judgment and consensus.
Source:
- Casper et al., 2023, Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. arxiv.org/abs/2307.15217
- Ziegler et al., 2022, Fine-Tuning Language Models from Human Preferences. arxiv.org/abs/1909.08593
- Askell et al., 2021, A General Language Assistant as a Laboratory for Alignment. arxiv.org/abs/2112.00861

What Is RLHF by Sam Altman
5:59
What is RLHF? Reinforcement Learning with Human Feedback, …
6:07
… So, we trained these models on a lot of text data and, in that process, they
learned the underlying, …. And they can do amazing things.
6:26
But when you ﬁrst play with that base model, that we call it, after you ﬁnish
training, … it can do a lot of, you know, there's knowledge in there. But it's not
very useful or, at least, it's not easy to use, let's say. And RLHF is how we
take some human feedback,
6:45
the simplest version of this is show two outputs, ask which one is better
than the other,
6:50
which one the human raters prefer, and then feed that back into the model
with reinforcement learning.
6:56
And that process works remarkably well with, in my opinion, remarkably little
data to make the model more useful. So, RLHF is how we align the model to
what humans want it to do.
Sam Altman: OpenAI CEO on
GPT-4, ChatGPT, and the Future of
AI | Lex Fridman Podcast #367
(youtu.be/L_Guz73e6fw?si=vfkdtN
CyrQa1RzZR&t=359)

Source: Liu et al., 2022, Aligning Generative Language Models with Human Values. aclanthology.org/2022.ﬁndings-naacl.18
RLHF: Example of Alignment Tasks

RLHF Illustration
Source: Lambert et al., 2022, Illustrating Reinforcement Learning from Human Feedback (RLHF). huggingface.co/blog/rlhf
Images copied from huggingface.co/blog/rlhf

Assessing and Comparing LLMs
Metrics while training the model—ROUGE (summary) or BLEU (translation).
Benchmarks—A non-exhaustive list:
- ARC (Abstraction and Reasoning Corpus, arxiv.org/pdf/2305.18354),
- HellaSwag (arxiv.org/abs/1905.07830),
- TruthfulQA (arxiv.org/abs/2109.07958),
- GLUE & SuperGLUE (General Language Understanding Evaluation, gluebenchmark.com),
- HELM (Holistic Evaluation of Language Models, crfm.stanford.edu/helm),
- MMLU (Massive Multitask Language Understanding, arxiv.org/abs/2009.03300),
- BIG-bench (arxiv.org/pdf/2206.04615).
Others—“Auto-Eval of Question-Answering Tasks”
(blog.langchain.dev/auto-eval-of-question-answering-tasks).

Source: Wu et al., 2023,
BloombergGPT: A Large Language
Model for Finance.
arxiv.org/abs/2303.17564 (Table 13:
“BIG-bench hard results using
standard 3-shot prompting”)

Source: Touvron et al., 2023, Llama 2: Open Foundation and Fine-Tuned Chat Models,
scontent-fra3-1.xx.fbcdn.net/v/t39.2365-6/10000000_662098952474184_2584067087619170692_n.pdf

Apples and Oranges
linkedin.com/posts/clementdelangue_could-we-stop-comparing-raw-mo
dels-llama-activity-7121161375770398721-h2fX

Application:
Conversing With
Annual Reports

Question ChatGPT About the Latest Financial
Reports?
—blog.langchain.dev/tutorial-
chatgpt-over-your-data
“[ChatGPT] doesn’t know about
your private data, it doesn’t know
about recent sources of data.
Wouldn’t it be useful if it did?”

Workﬂow Overview
Question
Answer
« Quels vont être les dividendes payés
par action par le Groupe Crit ? »
« Le Groupe CRIT proposera lors de sa prochaine Assemblée Générale, le 9
juin 2023, le versement d'un dividende exceptionnel de 3,5 € par action. »
The example (the question and associated
answer) is a real example (the LLM was
“gpt-3.5-turbo” from OpenAI)
Technique described in: Lewis et al., 2020.
Retrieval-augmented generation for knowledge-intensive
nlp tasks. (doi.org/10.48550/arXiv.2005.11401)
Extracting
relevant
information
(“context”)
Generate a prompt
accordingly
(“question +
context”)
LLM
Vector store
Split into chunks
1
2 3
Compute
embeddings

Preliminary Prototype
Financial reports retrieved directly from the French AMF (“Autorité
des marchés ﬁnanciers”) via its API (info-ﬁnanciere.fr).
xhtml document in
French language.
Question and answer
are in English (they
would be in French
should the question be
asked in French).

Except where otherwise noted, this work is licensed under
https://creativecommons.org/licenses/by/4.0/
619.io

Introduction to LLMs

More Related Content

What's hot

Similar to Introduction to LLMs

Recently uploaded

Introduction to LLMs