Fine-tuning Large Language Model
for data-to-text generation
• Lead ML Engineer @ Ecentria Group / Intexsys
• I'm working on a large US e-commerce project
• M.Sc. in Computer Science @ TSI
• +10 years experience in Software Engineering
2
Agenda
• LLM fine-tuning introduction
• Data-to-text generation for specific business case
• Conclusions
• Q&A
3
Introduction to LLM fine-tuning
4
Zero/One/Few-shot learning
• Zero-shot learning
• Mean providing a prompt that isn’t a part of the training data
• Example: asking open questions to model
• One/Few-shot learning
• Provide one or few examples as a part of the prompt
• Example: asking the model to format the text and providing a few examples
• Prompt engineering
5
Q:
What is the
title of this
section?
A:
Introduction
to LLM fine-
tuning
What is fine-tuning?
In deep learning, fine-tuning is an approach to transfer learning in
which the weights of a pre-trained model are trained on new data.
https://en.wikipedia.org/wiki/Fine-tuning_(deep_learning)
6
This Photo by Unknown Author is licensed under CC BY-SA
When do you need to fine-tune the model?
• Prompt engineering did not work out.
• Retrieval augmented generation (RAG) didn’t work out.
• Highly qualitative data for training is available.
• Cost is not a problem.
• It is clear how to measure that result.
• Read more:
• https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/fine-
tuning-considerations
• https://platform.openai.com/docs/guides/fine-tuning/when-to-use-fine-
tuning
7
Difference between pretraining and fine-tuning 1/2
Pre-training Fine-tuning
Training time Weeks Hours
Compute Thousands of
GPUs
One or few
GPUs
Dataset Terabytes
(e.g., C4, Pile)
100-1000 MB
Budget $ Millions $ Hundreds
8
LLM
Pretraining
Pre-trained
LLM
Fine-tuning
Huge
datasets
and a lot
of compute
Small
dataset and
one or few
GPUs
Difference between pretraining and fine-tuning 2/2
Source: https://blog.research.google/2020/02/exploring-transfer-learning-with-t5.html
Terabytes of
input data
Thousands of
examples
Unsupervised learning
Supervised learning
9
Fine-tuning methods
• Full fine-tuning continues the initial training of the model using the
existing checkpoint.
• PEFT (Parameter-Efficient Fine-Tuning) methods only fine-tune a small
number of (extra) model parameters - significantly decreasing
computational and storage costs - while yielding performance comparable
to a fully fine-tuned model.
• LoRA is a way to train large models efficiently by inserting (typically in the attention
blocks) smaller trainable matrices to be learnt during finetuning.
• Prompt-based methods (p-tuning, prefix tuning, prompt tuning). Instead of manually
creating hard (text) prompts, soft prompting methods are applied by adding
learnable parameters to the input embeddings that can be optimized for a specific
task while keeping the pre-trained model’s parameters frozen.
10
Read more: https://huggingface.co/docs/peft/index
Pre-trained open-source LLMs
• Consider licenses that allow commercial use cases.
• A larger LLM has greater capabilities, but it also requires higher
computing resources.
• A larger context window allows adding more information into context.
• Most of the attractive models:
• Mistral with 7B params, 4096 tokens and 16K sliding window, Apache License
2.0
• Gemma with 7B params, 8192 tokens, Google’s Gemma Terms of use
11
Read mode: https://github.com/eugeneyan/open-llms
Libraries for fine-tuning
Library name Company Popularit
y ⭐
PEF
T
DL
Framework
Supported LLM models Links
Deep Speed Microsoft 31.5k ✅ PyTorch A lot docs, github
PEFT HuggingFace🤗 12.7k ✅ PyTorch LLaMA, Mistral, T5, GPT,
others
blog, github,
docs
Accelerate HuggingFace🤗 6.6k ✖️ PyTorch A lot github, docs
NeMo Nvidia 9.4k ✅ PyTorch LLaMA, Falcon, T5, GPT,
others
docs, github
T5X Google 2.3k ❔ JAX T5 and some others, PaLM* paper, github,
docs
Paxml Google 0.3k ❔ JAX PaLM-2* docs, github
12
Supervised fine-tuning in clouds
13
Cloud LLM Model
Azure GPT, Llama
AWS Bedrock Amazon Titan, Anthropic Cloude, Cohere Command, Meta Llama [link]
GCP Vertex AI* PaLM 🌴, Gemma, T5, Gemini**, Llama
OpenAI Platform GPT
Anthropic Claude
Cohere Command
MosaicML MPT
* - supports RLHF
** - coming soon
Hardware for industrial needs
14
Read more: https://khairy2011.medium.com/tpu-vs-gpu-vs-cerebras-vs-
graphcore-a-fair-comparison-between-ml-hardware-3f5a19d89e38
• Nvidia GPU:
• H100 up to 80GB RAM
• Supports any framework
• Available in any cloud
• Requires NVLink/NVSwitch for efficient
data/model parallelism
• On-prem possibility
• Google TPU
• More const efficient
• V3-8 up to 128GB RAM
• Support XLA only: Jax, PyTorch/XLA, TF
• GCP lock
• Supports data/model parallelism out-of-the-
box
Other important topics
• Inference performance optimization by reducing memory footprint and
improving parallelizability.
• Efficient attention using lower-level hardware-aware optimizations (e.g., Flash
Attention)
• Quantization by reducing the computational precision of weights and activations
• Mixture of Experts to decrease inference time by not using all experts at once.
• and others.
• Hallucinations solutions
• Retrieval-augmented generation (RAG)
• Misleading behavior solutions
• Reinforcement Learning From Human Feedback (RLHF)
Kaddour, J., Harris, J., Mozes, M., Bradley, H., Raileanu, R. and McHardy, R., 2023. Challenges and applications
of large language models. arXiv preprint arXiv:2307.10169. 15
Data-to-Text generation
16
Data-to-Text generation
17
{
"city": "San Francisco",
"date": "2024-02-28",
"temperature": {
"high": 68,
"low": 51
},
"conditions": "Sunny",
"wind_speed": 5,
"humidity": 72
}
LLM
Today's weather report for
San Francisco on February
28, 2024, indicates a sunny
day with a high of 68°F and
a low of 51°F. Expect mild
wind speeds around 5 mph
and a humidity level of
72%. It's a beautiful day to
be outdoors!
Out setup
• Model: T5 3B, (encoder-decoder arch.)
• Library: T5X
• Hardware: single TPU v3.8 128GB (enough to fit T5 3B into single TPU
unit)
• Cloud: GCP (we can migrate model training to GPUs)
• Dataset: 300 000 train and 50 000 for A/B testing
• Supervised Full Fine-tuning.
18
Why T5 in 2021/2022?
• GPT-2/J/Neo vs T5.
• T5X mature fine-tuning
library.
• Large models up to 11B.
• Cost-efficient training using
TPUs and JAX.
• Apache 2.0 license.
• Llama and most other
popular open models were
released later (today, 2024)
19
GPT-2
T5
Training task definition
20
City: San Francisco
Date: 2024-02-28
Temperature: 68, 51
Conditions: Sunny
Wind speed: 5
Humidity: 72
T5 3B
Today's weather report for
San Francisco on February
28, 2024, indicates a sunny
day with a high of 68°F and
a low of 51°F. Expect mild
wind speeds around 5 mph
and a humidity level of
72%. It's a beautiful day to
be outdoors!
City: San Francisco
Date: 2024-02-28
Temperature: 68, 51
Conditions: Sunny
Wind speed: 5
Humidity: 72
City: San Francisco
Date: 2024-02-28
Temperature: 68, 51
Conditions: Sunny
Wind speed: 5
Humidity: 72
Today's weather report for
San Francisco on February
28, 2024, indicates a sunny
day with a high of 68°F and
a low of 51°F. Expect mild
wind speeds around 5 mph
and a humidity level of
72%. It's a beautiful day to
be outdoors!
Today's weather report for
San Francisco on February
28, 2024, indicates a sunny
day with a high of 68°F and
a low of 51°F. Expect mild
wind speeds around 5 mph
and a humidity level of
72%. It's a beautiful day to
be outdoors!
300,000 training pairs of input and output
Problems with smaller models
• Problems (in both the smallest size GPT and the smallest size T5 60m
params):
• Not natural-sounding enough
• Too short
• Texts have “errors” (e.g., wrong number values)
• Model used “forbidden phrases”
• “Messy” text: e.g., same phrases repeating twice within the same text,
sometimes in a row; incomplete sentences/brief expressions without context
at the end of the text, etc.
21
Solution
• Use larger model T5 3B
• Additional manipulations with input data:
• Cleaned up the text by removing irrelevant information.
• Supplying “forbidden phrases” as inputs to avoid using them liberally by the
model.
• Providing explicit labels in input prompt:
• “Low quality” – helps the model to distinguish the text quality.
• “Too short” – helps the model to distinguish between short and long descriptions.
• Text clustering using Locality-sensitive hashing (LSH) to find similar texts.
22
Regularize the model via prompting
23
23
City: San Francisco
Date: 2024-02-28
Temperature: 68, 51 Conditions:
Sunny
Wind speed: 5
Humidity: 72
Less formal
LLM
Today's weather report for San
Francisco on February 28, 2024,
indicates a sunny day with a high
of 68°F and a low of 51°F. Expect
mild wind speeds around 5 mph
and a humidity level of 72%. It's
a beautiful day to be outdoors!
City: San Francisco
Date: 2024-02-28
Temperature: 68, 51 Conditions:
Sunny
Wind speed: 5
Humidity: 72
LLM
Today's weather report for San
Francisco on February 28, 2024,
indicates a sunny day with a high
of 68°F and a low of 51°F. Expect
mild wind speeds around 5 mph
and a humidity level of 72%.
Training
Generation
Model learns the relationship between the input prompt and the required generated text.
Model evaluation by size
24
70%
75%
80%
85%
90%
95%
Small XL
Text w/o “forbidden words” (higher is better)
0
5
10
15
20
25
- 0 0 - 10 10 - 20 20 - 30 30 - 40 40 - 50 50 - 60 60 - 70 70 - 80 80 - 90 90 - 100 100 - 110 110 - 120 120 - 130 130 - 140 140 - 150 150 - 160 160 - 170 170 - 180 180 - 190 190 - 200 200 - 210 210 -
Number
of
examples
Length distribution (words)
T5 small T5 XL
Results evaluation
• Custom metrics.
• Example. Numbers of occurrences in the generated text:
• forbidden words
• unseen numbers
• Expert verification.
• Experts reviewed generated texts.
• Set of known examples that were monitored.
• A/B testing validation.
25
Conclusions
• While getting into the LLM field has a very steep
learning curve, collecting datasets is the most
important and time-consuming part.
• Larger models provide better results (e.g., 1B vs
3B). However, larger models require more ML
engineering efforts.
• Fine-tuning requires very high-quality data.
26
Thank you!
27
Model size
28
We are
here
Anthropic’s RLHF dataset example
29
Source https://huggingface.co/datasets/Anthropic/hh-rlhf?row=0
Different usages of the models
• Prediction using zero-shot learning
• No need for additional training
• Very large models are required (>=100 billion parameters)
• Not-so-precise prediction
• Simple, prompt engineering is required
• It is almost impossible to influence the model’s generated text quality
• Few examples are required
• Prediction using fine-tunned model
• Requires model’s additional training (might take from hours to days)
• Not-so-large models are required (<11 billion parameters)
• More precise predictions
• Advanced knowledge is required to fine-tune the model
• Adjustable generated text quality by improving the dataset for fine-tuning
• >=10 000 examples are required
30
Using fine-tunned models
Choose cloud or
on-prem
pre-trained model
Fine-tuning
Use model for
prediction
Choose cloud or
on-prem
pre-trained model
Prompt
engineering
Use model for
prediction
Piece of
cake
Challenging
Better
generated text
quality
Not so precise
generated text
Using models with Zero-shot learning
31

Fine-tuning Large Language Models by Dmitry Balabka

  • 1.
    Fine-tuning Large LanguageModel for data-to-text generation
  • 2.
    • Lead MLEngineer @ Ecentria Group / Intexsys • I'm working on a large US e-commerce project • M.Sc. in Computer Science @ TSI • +10 years experience in Software Engineering 2
  • 3.
    Agenda • LLM fine-tuningintroduction • Data-to-text generation for specific business case • Conclusions • Q&A 3
  • 4.
    Introduction to LLMfine-tuning 4
  • 5.
    Zero/One/Few-shot learning • Zero-shotlearning • Mean providing a prompt that isn’t a part of the training data • Example: asking open questions to model • One/Few-shot learning • Provide one or few examples as a part of the prompt • Example: asking the model to format the text and providing a few examples • Prompt engineering 5 Q: What is the title of this section? A: Introduction to LLM fine- tuning
  • 6.
    What is fine-tuning? Indeep learning, fine-tuning is an approach to transfer learning in which the weights of a pre-trained model are trained on new data. https://en.wikipedia.org/wiki/Fine-tuning_(deep_learning) 6 This Photo by Unknown Author is licensed under CC BY-SA
  • 7.
    When do youneed to fine-tune the model? • Prompt engineering did not work out. • Retrieval augmented generation (RAG) didn’t work out. • Highly qualitative data for training is available. • Cost is not a problem. • It is clear how to measure that result. • Read more: • https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/fine- tuning-considerations • https://platform.openai.com/docs/guides/fine-tuning/when-to-use-fine- tuning 7
  • 8.
    Difference between pretrainingand fine-tuning 1/2 Pre-training Fine-tuning Training time Weeks Hours Compute Thousands of GPUs One or few GPUs Dataset Terabytes (e.g., C4, Pile) 100-1000 MB Budget $ Millions $ Hundreds 8 LLM Pretraining Pre-trained LLM Fine-tuning Huge datasets and a lot of compute Small dataset and one or few GPUs
  • 9.
    Difference between pretrainingand fine-tuning 2/2 Source: https://blog.research.google/2020/02/exploring-transfer-learning-with-t5.html Terabytes of input data Thousands of examples Unsupervised learning Supervised learning 9
  • 10.
    Fine-tuning methods • Fullfine-tuning continues the initial training of the model using the existing checkpoint. • PEFT (Parameter-Efficient Fine-Tuning) methods only fine-tune a small number of (extra) model parameters - significantly decreasing computational and storage costs - while yielding performance comparable to a fully fine-tuned model. • LoRA is a way to train large models efficiently by inserting (typically in the attention blocks) smaller trainable matrices to be learnt during finetuning. • Prompt-based methods (p-tuning, prefix tuning, prompt tuning). Instead of manually creating hard (text) prompts, soft prompting methods are applied by adding learnable parameters to the input embeddings that can be optimized for a specific task while keeping the pre-trained model’s parameters frozen. 10 Read more: https://huggingface.co/docs/peft/index
  • 11.
    Pre-trained open-source LLMs •Consider licenses that allow commercial use cases. • A larger LLM has greater capabilities, but it also requires higher computing resources. • A larger context window allows adding more information into context. • Most of the attractive models: • Mistral with 7B params, 4096 tokens and 16K sliding window, Apache License 2.0 • Gemma with 7B params, 8192 tokens, Google’s Gemma Terms of use 11 Read mode: https://github.com/eugeneyan/open-llms
  • 12.
    Libraries for fine-tuning Libraryname Company Popularit y ⭐ PEF T DL Framework Supported LLM models Links Deep Speed Microsoft 31.5k ✅ PyTorch A lot docs, github PEFT HuggingFace🤗 12.7k ✅ PyTorch LLaMA, Mistral, T5, GPT, others blog, github, docs Accelerate HuggingFace🤗 6.6k ✖️ PyTorch A lot github, docs NeMo Nvidia 9.4k ✅ PyTorch LLaMA, Falcon, T5, GPT, others docs, github T5X Google 2.3k ❔ JAX T5 and some others, PaLM* paper, github, docs Paxml Google 0.3k ❔ JAX PaLM-2* docs, github 12
  • 13.
    Supervised fine-tuning inclouds 13 Cloud LLM Model Azure GPT, Llama AWS Bedrock Amazon Titan, Anthropic Cloude, Cohere Command, Meta Llama [link] GCP Vertex AI* PaLM 🌴, Gemma, T5, Gemini**, Llama OpenAI Platform GPT Anthropic Claude Cohere Command MosaicML MPT * - supports RLHF ** - coming soon
  • 14.
    Hardware for industrialneeds 14 Read more: https://khairy2011.medium.com/tpu-vs-gpu-vs-cerebras-vs- graphcore-a-fair-comparison-between-ml-hardware-3f5a19d89e38 • Nvidia GPU: • H100 up to 80GB RAM • Supports any framework • Available in any cloud • Requires NVLink/NVSwitch for efficient data/model parallelism • On-prem possibility • Google TPU • More const efficient • V3-8 up to 128GB RAM • Support XLA only: Jax, PyTorch/XLA, TF • GCP lock • Supports data/model parallelism out-of-the- box
  • 15.
    Other important topics •Inference performance optimization by reducing memory footprint and improving parallelizability. • Efficient attention using lower-level hardware-aware optimizations (e.g., Flash Attention) • Quantization by reducing the computational precision of weights and activations • Mixture of Experts to decrease inference time by not using all experts at once. • and others. • Hallucinations solutions • Retrieval-augmented generation (RAG) • Misleading behavior solutions • Reinforcement Learning From Human Feedback (RLHF) Kaddour, J., Harris, J., Mozes, M., Bradley, H., Raileanu, R. and McHardy, R., 2023. Challenges and applications of large language models. arXiv preprint arXiv:2307.10169. 15
  • 16.
  • 17.
    Data-to-Text generation 17 { "city": "SanFrancisco", "date": "2024-02-28", "temperature": { "high": 68, "low": 51 }, "conditions": "Sunny", "wind_speed": 5, "humidity": 72 } LLM Today's weather report for San Francisco on February 28, 2024, indicates a sunny day with a high of 68°F and a low of 51°F. Expect mild wind speeds around 5 mph and a humidity level of 72%. It's a beautiful day to be outdoors!
  • 18.
    Out setup • Model:T5 3B, (encoder-decoder arch.) • Library: T5X • Hardware: single TPU v3.8 128GB (enough to fit T5 3B into single TPU unit) • Cloud: GCP (we can migrate model training to GPUs) • Dataset: 300 000 train and 50 000 for A/B testing • Supervised Full Fine-tuning. 18
  • 19.
    Why T5 in2021/2022? • GPT-2/J/Neo vs T5. • T5X mature fine-tuning library. • Large models up to 11B. • Cost-efficient training using TPUs and JAX. • Apache 2.0 license. • Llama and most other popular open models were released later (today, 2024) 19 GPT-2 T5
  • 20.
    Training task definition 20 City:San Francisco Date: 2024-02-28 Temperature: 68, 51 Conditions: Sunny Wind speed: 5 Humidity: 72 T5 3B Today's weather report for San Francisco on February 28, 2024, indicates a sunny day with a high of 68°F and a low of 51°F. Expect mild wind speeds around 5 mph and a humidity level of 72%. It's a beautiful day to be outdoors! City: San Francisco Date: 2024-02-28 Temperature: 68, 51 Conditions: Sunny Wind speed: 5 Humidity: 72 City: San Francisco Date: 2024-02-28 Temperature: 68, 51 Conditions: Sunny Wind speed: 5 Humidity: 72 Today's weather report for San Francisco on February 28, 2024, indicates a sunny day with a high of 68°F and a low of 51°F. Expect mild wind speeds around 5 mph and a humidity level of 72%. It's a beautiful day to be outdoors! Today's weather report for San Francisco on February 28, 2024, indicates a sunny day with a high of 68°F and a low of 51°F. Expect mild wind speeds around 5 mph and a humidity level of 72%. It's a beautiful day to be outdoors! 300,000 training pairs of input and output
  • 21.
    Problems with smallermodels • Problems (in both the smallest size GPT and the smallest size T5 60m params): • Not natural-sounding enough • Too short • Texts have “errors” (e.g., wrong number values) • Model used “forbidden phrases” • “Messy” text: e.g., same phrases repeating twice within the same text, sometimes in a row; incomplete sentences/brief expressions without context at the end of the text, etc. 21
  • 22.
    Solution • Use largermodel T5 3B • Additional manipulations with input data: • Cleaned up the text by removing irrelevant information. • Supplying “forbidden phrases” as inputs to avoid using them liberally by the model. • Providing explicit labels in input prompt: • “Low quality” – helps the model to distinguish the text quality. • “Too short” – helps the model to distinguish between short and long descriptions. • Text clustering using Locality-sensitive hashing (LSH) to find similar texts. 22
  • 23.
    Regularize the modelvia prompting 23 23 City: San Francisco Date: 2024-02-28 Temperature: 68, 51 Conditions: Sunny Wind speed: 5 Humidity: 72 Less formal LLM Today's weather report for San Francisco on February 28, 2024, indicates a sunny day with a high of 68°F and a low of 51°F. Expect mild wind speeds around 5 mph and a humidity level of 72%. It's a beautiful day to be outdoors! City: San Francisco Date: 2024-02-28 Temperature: 68, 51 Conditions: Sunny Wind speed: 5 Humidity: 72 LLM Today's weather report for San Francisco on February 28, 2024, indicates a sunny day with a high of 68°F and a low of 51°F. Expect mild wind speeds around 5 mph and a humidity level of 72%. Training Generation Model learns the relationship between the input prompt and the required generated text.
  • 24.
    Model evaluation bysize 24 70% 75% 80% 85% 90% 95% Small XL Text w/o “forbidden words” (higher is better) 0 5 10 15 20 25 - 0 0 - 10 10 - 20 20 - 30 30 - 40 40 - 50 50 - 60 60 - 70 70 - 80 80 - 90 90 - 100 100 - 110 110 - 120 120 - 130 130 - 140 140 - 150 150 - 160 160 - 170 170 - 180 180 - 190 190 - 200 200 - 210 210 - Number of examples Length distribution (words) T5 small T5 XL
  • 25.
    Results evaluation • Custommetrics. • Example. Numbers of occurrences in the generated text: • forbidden words • unseen numbers • Expert verification. • Experts reviewed generated texts. • Set of known examples that were monitored. • A/B testing validation. 25
  • 26.
    Conclusions • While gettinginto the LLM field has a very steep learning curve, collecting datasets is the most important and time-consuming part. • Larger models provide better results (e.g., 1B vs 3B). However, larger models require more ML engineering efforts. • Fine-tuning requires very high-quality data. 26
  • 27.
  • 28.
  • 29.
    Anthropic’s RLHF datasetexample 29 Source https://huggingface.co/datasets/Anthropic/hh-rlhf?row=0
  • 30.
    Different usages ofthe models • Prediction using zero-shot learning • No need for additional training • Very large models are required (>=100 billion parameters) • Not-so-precise prediction • Simple, prompt engineering is required • It is almost impossible to influence the model’s generated text quality • Few examples are required • Prediction using fine-tunned model • Requires model’s additional training (might take from hours to days) • Not-so-large models are required (<11 billion parameters) • More precise predictions • Advanced knowledge is required to fine-tune the model • Adjustable generated text quality by improving the dataset for fine-tuning • >=10 000 examples are required 30
  • 31.
    Using fine-tunned models Choosecloud or on-prem pre-trained model Fine-tuning Use model for prediction Choose cloud or on-prem pre-trained model Prompt engineering Use model for prediction Piece of cake Challenging Better generated text quality Not so precise generated text Using models with Zero-shot learning 31