Large Language Models Are Reasoning Teachers

Large Language Models Are
Reasoning Teachers
Namgyu Ho Laura Schmid Se-Young Yun
KAIST AI
🧑🏫

Short Summary
§ Chain-of-thought (CoT) reasoning [Wei 2022] enables complex reasoning
… in huge models with over 100B 🤯 parameters.
Large Language Models Are Reasoning Teachers

Short Summary
… in huge models with over 400GB VRAM 💰.

Short Summary
§ We use GPT-3 175B as a reasoning teacher 🧑🏫
to teach smaller students with 70M‒6.7B parameters.

Short Summary
§ Diverse reasoning ✨ is a simple way to boost teaching.

Short Summary
§ Diverse reasoning ✨ is a simple way to boost teaching.
§ Extensive analysis 🕵 on the emergence of reasoning.

Introduction
§ Background: chain-of-thought (CoT) prompting [Weil 2022] elicits models to
solve complex reasoning tasks step-by-step
§ Standard prompting is insu cient.

Introduction
§ Standard prompting is insufficient.
§ Limitation: CoT prompting is only applicable to very large models such as GPT-
3 175B and PaLM.

Introduction
§ Standard prompting is insufficient.
§ Limitation: CoT prompting is only applicable to very large models such as GPT-
3 175B and PaLM.
§ Solution: apply CoT prompting on very large models to generate training data
on complex reasoning for smaller models.

Method: Fine-tune-CoT
Original Sample
Question
A pet store had 56 puppies. In one
day they sold 24 of them and put the
rest into cages with 4 in each cage.
How many cages did they use?
Answer
.8.
Prompt (Zero-shot-CoT)
Q: A pet store had 56 puppies. In one day
they sold 24 of them and put the rest into
cages with 4 in each cage. How many
cages did they use?
A: Let’s think step by step.
Completion (Generated)
The store started with 56 puppies. 24 of
them were sold, so that means that there
are now 32 puppies left. Since there are 4
puppies in each cage, that means that the
store now has .8 cages.
Step 1. Reasoning Generation
Large 175B Teacher Model
Step 2. Curation
Small Student Model
Prompt
A pet store had 56 puppies. In one day
cages did they use? ###
Completion
are now 32 puppies left. Since there are
4 puppies in each cage, that means that
the store now has 8 cages.
--> 8 END
Reasoning Sample (Curated)
Dataset
Step 3. Fine-tuning
{
Diverse Reasoning

Method: Fine-tune-CoT
Original Sample
Question
A pet store had 56 puppies. In one
day they sold 24 of them and put the
rest into cages with 4 in each cage.
How many cages did they use?
Answer
.8.
Prompt (Zero-shot-CoT)
Q: A pet store had 56 puppies. In one day
cages did they use?
A: Let’s think step by step.
Completion (Generated)
are now 32 puppies left. Since there are 4
puppies in each cage, that means that the
store now has .8 cages.
Step 1. Reasoning Generation
Large 175B Teacher Model
Step 2. Curation
Small Student Model
Prompt
A pet store had 56 puppies. In one day
cages did they use? ###
Completion
are now 32 puppies left. Since there are
4 puppies in each cage, that means that
the store now has 8 cages.
--> 8 END
Reasoning Sample (Curated)
Dataset
Step 3. Fine-tuning
{
Diverse Reasoning
✨

Results

Results
§ Fine-tune-CoT enables significant reasoning capabilities in small models.
§ Diverse reasoning boosts performance substantially.

Results
§ Performance Scalability
1. Diverse reasoning
2. Dataset size
3. Teacher performance
4. Student model scale

Results
§ Fine-tune-CoT enables significant reasoning capabilities in small models.
§ Performance is highly scalable under Fine-tune-CoT.

Results
§ Fine-tune-CoT enables signi cant reasoning capabilities in small models.
§ Performance is highly scalable under Fine-tune-CoT.
§ Tradeo s must be considered between
§ Development-time cost: diverse reasoning, dataset size, teacher model
§ Inference-time cost: student model

(Analysis & Discussion)
§ Cost analysis of data acquisition
§ How to filter teacher reasoning samples. Do we need to?
§ Emergence of reasoning in small language models
§ Distillation of emergent abilities
§ Connection with knowledge distillation

Takeaways
§ Simple distillation can transfer 🧚 reasoning abilities from very large teachers
to small students <1B for a single domain.
§ What about other emergent abilities?
§ Fine-tune-CoT with diverse reasoning is an accessible and e ective approach
which is highly scalable.
§ Distillation poses a tradeo between development costs and inference
cost/quality.

Large Language Models Are
Reasoning Teachers
Namgyu Ho Laura Schmid Se-Young Yun
KAIST AI
🧑🏫
Paper
§ Why does reasoning
emerge in small models
§ Results on GPT-2, T5
Code
§ All code and data
§ $1000+ worth of teacher data
with ❤ from OSI LAB @ KAIST.

Large Language Models Are Reasoning Teachers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Large Language Models Are Reasoning Teachers

Similar to Large Language Models Are Reasoning Teachers (20)

Recently uploaded

Recently uploaded (20)

Large Language Models Are Reasoning Teachers