Large Language Models Are
Reasoning Teachers
Namgyu Ho Laura Schmid Se-Young Yun
KAIST AI
🧑🏫
Short Summary
§ Chain-of-thought (CoT) reasoning [Wei 2022] enables complex reasoning
… in huge models with over 100B 🤯 parameters.
Large Language Models Are Reasoning Teachers
Short Summary
§ Chain-of-thought (CoT) reasoning [Wei 2022] enables complex reasoning
… in huge models with over 400GB VRAM 💰.
Large Language Models Are Reasoning Teachers
Short Summary
§ Chain-of-thought (CoT) reasoning [Wei 2022] enables complex reasoning
… in huge models with over 100B 🤯 parameters.
§ We use GPT-3 175B as a reasoning teacher 🧑🏫
to teach smaller students with 70M‒6.7B parameters.
Large Language Models Are Reasoning Teachers
Short Summary
§ Chain-of-thought (CoT) reasoning [Wei 2022] enables complex reasoning
… in huge models with over 100B 🤯 parameters.
§ We use GPT-3 175B as a reasoning teacher 🧑🏫
to teach smaller students with 70M‒6.7B parameters.
§ Diverse reasoning ✨ is a simple way to boost teaching.
Large Language Models Are Reasoning Teachers
Short Summary
§ Chain-of-thought (CoT) reasoning [Wei 2022] enables complex reasoning
… in huge models with over 100B 🤯 parameters.
§ We use GPT-3 175B as a reasoning teacher 🧑🏫
to teach smaller students with 70M‒6.7B parameters.
§ Diverse reasoning ✨ is a simple way to boost teaching.
§ Extensive analysis 🕵 on the emergence of reasoning.
Large Language Models Are Reasoning Teachers
Introduction
§ Background: chain-of-thought (CoT) prompting [Weil 2022] elicits models to
solve complex reasoning tasks step-by-step
§ Standard prompting is insu cient.
Large Language Models Are Reasoning Teachers
Introduction
§ Background: chain-of-thought (CoT) prompting [Weil 2022] elicits models to
solve complex reasoning tasks step-by-step
§ Standard prompting is insufficient.
§ Limitation: CoT prompting is only applicable to very large models such as GPT-
3 175B and PaLM.
Large Language Models Are Reasoning Teachers
Introduction
§ Background: chain-of-thought (CoT) prompting [Weil 2022] elicits models to
solve complex reasoning tasks step-by-step
§ Standard prompting is insufficient.
§ Limitation: CoT prompting is only applicable to very large models such as GPT-
3 175B and PaLM.
§ Solution: apply CoT prompting on very large models to generate training data
on complex reasoning for smaller models.
Large Language Models Are Reasoning Teachers
Method: Fine-tune-CoT
Large Language Models Are Reasoning Teachers
Original Sample
Question
A pet store had 56 puppies. In one
day they sold 24 of them and put the
rest into cages with 4 in each cage.
How many cages did they use?
Answer
.8.
Prompt (Zero-shot-CoT)
Q: A pet store had 56 puppies. In one day
they sold 24 of them and put the rest into
cages with 4 in each cage. How many
cages did they use?
A: Let’s think step by step.
Completion (Generated)
The store started with 56 puppies. 24 of
them were sold, so that means that there
are now 32 puppies left. Since there are 4
puppies in each cage, that means that the
store now has .8 cages.
Step 1. Reasoning Generation
Large 175B Teacher Model
Step 2. Curation
Small Student Model
Prompt
A pet store had 56 puppies. In one day
they sold 24 of them and put the rest into
cages with 4 in each cage. How many
cages did they use? ###
Completion
The store started with 56 puppies. 24 of
them were sold, so that means that there
are now 32 puppies left. Since there are
4 puppies in each cage, that means that
the store now has 8 cages.
--> 8 END
Reasoning Sample (Curated)
Dataset
Step 3. Fine-tuning
{
Diverse Reasoning
Method: Fine-tune-CoT
Large Language Models Are Reasoning Teachers
Original Sample
Question
A pet store had 56 puppies. In one
day they sold 24 of them and put the
rest into cages with 4 in each cage.
How many cages did they use?
Answer
.8.
Prompt (Zero-shot-CoT)
Q: A pet store had 56 puppies. In one day
they sold 24 of them and put the rest into
cages with 4 in each cage. How many
cages did they use?
A: Let’s think step by step.
Completion (Generated)
The store started with 56 puppies. 24 of
them were sold, so that means that there
are now 32 puppies left. Since there are 4
puppies in each cage, that means that the
store now has .8 cages.
Step 1. Reasoning Generation
Large 175B Teacher Model
Step 2. Curation
Small Student Model
Prompt
A pet store had 56 puppies. In one day
they sold 24 of them and put the rest into
cages with 4 in each cage. How many
cages did they use? ###
Completion
The store started with 56 puppies. 24 of
them were sold, so that means that there
are now 32 puppies left. Since there are
4 puppies in each cage, that means that
the store now has 8 cages.
--> 8 END
Reasoning Sample (Curated)
Dataset
Step 3. Fine-tuning
{
Diverse Reasoning
✨
Results
Large Language Models Are Reasoning Teachers
Results
Large Language Models Are Reasoning Teachers
Results
Large Language Models Are Reasoning Teachers
Results
Large Language Models Are Reasoning Teachers
§ Fine-tune-CoT enables significant reasoning capabilities in small models.
§ Diverse reasoning boosts performance substantially.
Results
Large Language Models Are Reasoning Teachers
§ Performance Scalability
1. Diverse reasoning
2. Dataset size
3. Teacher performance
4. Student model scale
Results
Large Language Models Are Reasoning Teachers
§ Performance Scalability
1. Diverse reasoning
2. Dataset size
3. Teacher performance
4. Student model scale
Results
Large Language Models Are Reasoning Teachers
§ Performance Scalability
1. Diverse reasoning
2. Dataset size
3. Teacher performance
4. Student model scale
Results
Large Language Models Are Reasoning Teachers
§ Performance Scalability
1. Diverse reasoning
2. Dataset size
3. Teacher performance
4. Student model scale
Results
Large Language Models Are Reasoning Teachers
§ Fine-tune-CoT enables significant reasoning capabilities in small models.
§ Diverse reasoning boosts performance substantially.
§ Performance is highly scalable under Fine-tune-CoT.
Results
Large Language Models Are Reasoning Teachers
§ Fine-tune-CoT enables signi cant reasoning capabilities in small models.
§ Diverse reasoning boosts performance substantially.
§ Performance is highly scalable under Fine-tune-CoT.
§ Tradeo s must be considered between
§ Development-time cost: diverse reasoning, dataset size, teacher model
§ Inference-time cost: student model
(Analysis & Discussion)
§ Cost analysis of data acquisition
§ How to filter teacher reasoning samples. Do we need to?
§ Emergence of reasoning in small language models
§ Distillation of emergent abilities
§ Connection with knowledge distillation
Large Language Models Are Reasoning Teachers
Takeaways
§ Simple distillation can transfer 🧚 reasoning abilities from very large teachers
to small students <1B for a single domain.
§ What about other emergent abilities?
§ Fine-tune-CoT with diverse reasoning is an accessible and e ective approach
which is highly scalable.
§ Distillation poses a tradeo between development costs and inference
cost/quality.
Large Language Models Are Reasoning Teachers
Large Language Models Are
Reasoning Teachers
Namgyu Ho Laura Schmid Se-Young Yun
KAIST AI
🧑🏫
Paper
§ Why does reasoning
emerge in small models
§ Results on GPT-2, T5
Code
§ All code and data
§ $1000+ worth of teacher data
with ❤ from OSI LAB @ KAIST.

Large Language Models Are Reasoning Teachers

  • 1.
    Large Language ModelsAre Reasoning Teachers Namgyu Ho Laura Schmid Se-Young Yun KAIST AI 🧑🏫
  • 2.
    Short Summary § Chain-of-thought(CoT) reasoning [Wei 2022] enables complex reasoning … in huge models with over 100B 🤯 parameters. Large Language Models Are Reasoning Teachers
  • 3.
    Short Summary § Chain-of-thought(CoT) reasoning [Wei 2022] enables complex reasoning … in huge models with over 400GB VRAM 💰. Large Language Models Are Reasoning Teachers
  • 4.
    Short Summary § Chain-of-thought(CoT) reasoning [Wei 2022] enables complex reasoning … in huge models with over 100B 🤯 parameters. § We use GPT-3 175B as a reasoning teacher 🧑🏫 to teach smaller students with 70M‒6.7B parameters. Large Language Models Are Reasoning Teachers
  • 5.
    Short Summary § Chain-of-thought(CoT) reasoning [Wei 2022] enables complex reasoning … in huge models with over 100B 🤯 parameters. § We use GPT-3 175B as a reasoning teacher 🧑🏫 to teach smaller students with 70M‒6.7B parameters. § Diverse reasoning ✨ is a simple way to boost teaching. Large Language Models Are Reasoning Teachers
  • 6.
    Short Summary § Chain-of-thought(CoT) reasoning [Wei 2022] enables complex reasoning … in huge models with over 100B 🤯 parameters. § We use GPT-3 175B as a reasoning teacher 🧑🏫 to teach smaller students with 70M‒6.7B parameters. § Diverse reasoning ✨ is a simple way to boost teaching. § Extensive analysis 🕵 on the emergence of reasoning. Large Language Models Are Reasoning Teachers
  • 7.
    Introduction § Background: chain-of-thought(CoT) prompting [Weil 2022] elicits models to solve complex reasoning tasks step-by-step § Standard prompting is insu cient. Large Language Models Are Reasoning Teachers
  • 8.
    Introduction § Background: chain-of-thought(CoT) prompting [Weil 2022] elicits models to solve complex reasoning tasks step-by-step § Standard prompting is insufficient. § Limitation: CoT prompting is only applicable to very large models such as GPT- 3 175B and PaLM. Large Language Models Are Reasoning Teachers
  • 9.
    Introduction § Background: chain-of-thought(CoT) prompting [Weil 2022] elicits models to solve complex reasoning tasks step-by-step § Standard prompting is insufficient. § Limitation: CoT prompting is only applicable to very large models such as GPT- 3 175B and PaLM. § Solution: apply CoT prompting on very large models to generate training data on complex reasoning for smaller models. Large Language Models Are Reasoning Teachers
  • 10.
    Method: Fine-tune-CoT Large LanguageModels Are Reasoning Teachers Original Sample Question A pet store had 56 puppies. In one day they sold 24 of them and put the rest into cages with 4 in each cage. How many cages did they use? Answer .8. Prompt (Zero-shot-CoT) Q: A pet store had 56 puppies. In one day they sold 24 of them and put the rest into cages with 4 in each cage. How many cages did they use? A: Let’s think step by step. Completion (Generated) The store started with 56 puppies. 24 of them were sold, so that means that there are now 32 puppies left. Since there are 4 puppies in each cage, that means that the store now has .8 cages. Step 1. Reasoning Generation Large 175B Teacher Model Step 2. Curation Small Student Model Prompt A pet store had 56 puppies. In one day they sold 24 of them and put the rest into cages with 4 in each cage. How many cages did they use? ### Completion The store started with 56 puppies. 24 of them were sold, so that means that there are now 32 puppies left. Since there are 4 puppies in each cage, that means that the store now has 8 cages. --> 8 END Reasoning Sample (Curated) Dataset Step 3. Fine-tuning { Diverse Reasoning
  • 11.
    Method: Fine-tune-CoT Large LanguageModels Are Reasoning Teachers Original Sample Question A pet store had 56 puppies. In one day they sold 24 of them and put the rest into cages with 4 in each cage. How many cages did they use? Answer .8. Prompt (Zero-shot-CoT) Q: A pet store had 56 puppies. In one day they sold 24 of them and put the rest into cages with 4 in each cage. How many cages did they use? A: Let’s think step by step. Completion (Generated) The store started with 56 puppies. 24 of them were sold, so that means that there are now 32 puppies left. Since there are 4 puppies in each cage, that means that the store now has .8 cages. Step 1. Reasoning Generation Large 175B Teacher Model Step 2. Curation Small Student Model Prompt A pet store had 56 puppies. In one day they sold 24 of them and put the rest into cages with 4 in each cage. How many cages did they use? ### Completion The store started with 56 puppies. 24 of them were sold, so that means that there are now 32 puppies left. Since there are 4 puppies in each cage, that means that the store now has 8 cages. --> 8 END Reasoning Sample (Curated) Dataset Step 3. Fine-tuning { Diverse Reasoning ✨
  • 12.
    Results Large Language ModelsAre Reasoning Teachers
  • 13.
    Results Large Language ModelsAre Reasoning Teachers
  • 14.
    Results Large Language ModelsAre Reasoning Teachers
  • 15.
    Results Large Language ModelsAre Reasoning Teachers § Fine-tune-CoT enables significant reasoning capabilities in small models. § Diverse reasoning boosts performance substantially.
  • 16.
    Results Large Language ModelsAre Reasoning Teachers § Performance Scalability 1. Diverse reasoning 2. Dataset size 3. Teacher performance 4. Student model scale
  • 17.
    Results Large Language ModelsAre Reasoning Teachers § Performance Scalability 1. Diverse reasoning 2. Dataset size 3. Teacher performance 4. Student model scale
  • 18.
    Results Large Language ModelsAre Reasoning Teachers § Performance Scalability 1. Diverse reasoning 2. Dataset size 3. Teacher performance 4. Student model scale
  • 19.
    Results Large Language ModelsAre Reasoning Teachers § Performance Scalability 1. Diverse reasoning 2. Dataset size 3. Teacher performance 4. Student model scale
  • 20.
    Results Large Language ModelsAre Reasoning Teachers § Fine-tune-CoT enables significant reasoning capabilities in small models. § Diverse reasoning boosts performance substantially. § Performance is highly scalable under Fine-tune-CoT.
  • 21.
    Results Large Language ModelsAre Reasoning Teachers § Fine-tune-CoT enables signi cant reasoning capabilities in small models. § Diverse reasoning boosts performance substantially. § Performance is highly scalable under Fine-tune-CoT. § Tradeo s must be considered between § Development-time cost: diverse reasoning, dataset size, teacher model § Inference-time cost: student model
  • 22.
    (Analysis & Discussion) §Cost analysis of data acquisition § How to filter teacher reasoning samples. Do we need to? § Emergence of reasoning in small language models § Distillation of emergent abilities § Connection with knowledge distillation Large Language Models Are Reasoning Teachers
  • 23.
    Takeaways § Simple distillationcan transfer 🧚 reasoning abilities from very large teachers to small students <1B for a single domain. § What about other emergent abilities? § Fine-tune-CoT with diverse reasoning is an accessible and e ective approach which is highly scalable. § Distillation poses a tradeo between development costs and inference cost/quality. Large Language Models Are Reasoning Teachers
  • 24.
    Large Language ModelsAre Reasoning Teachers Namgyu Ho Laura Schmid Se-Young Yun KAIST AI 🧑🏫 Paper § Why does reasoning emerge in small models § Results on GPT-2, T5 Code § All code and data § $1000+ worth of teacher data with ❤ from OSI LAB @ KAIST.