focusing on the hands-on process of preparing datasets and fine-tuning models for a specific business task. This session will cover dataset preparation, model fine-tuning, and cloud ML accelerators like TPUs and related libraries. It’s aimed at those seeking hands-on knowledge in applying ML techniques.
2. • Lead ML Engineer @ Ecentria Group / Intexsys
• I'm working on a large US e-commerce project
• M.Sc. in Computer Science @ TSI
• +10 years experience in Software Engineering
2
3. Agenda
• LLM fine-tuning introduction
• Data-to-text generation for specific business case
• Conclusions
• Q&A
3
5. Zero/One/Few-shot learning
• Zero-shot learning
• Mean providing a prompt that isn’t a part of the training data
• Example: asking open questions to model
• One/Few-shot learning
• Provide one or few examples as a part of the prompt
• Example: asking the model to format the text and providing a few examples
• Prompt engineering
5
Q:
What is the
title of this
section?
A:
Introduction
to LLM fine-
tuning
6. What is fine-tuning?
In deep learning, fine-tuning is an approach to transfer learning in
which the weights of a pre-trained model are trained on new data.
https://en.wikipedia.org/wiki/Fine-tuning_(deep_learning)
6
This Photo by Unknown Author is licensed under CC BY-SA
7. When do you need to fine-tune the model?
• Prompt engineering did not work out.
• Retrieval augmented generation (RAG) didn’t work out.
• Highly qualitative data for training is available.
• Cost is not a problem.
• It is clear how to measure that result.
• Read more:
• https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/fine-
tuning-considerations
• https://platform.openai.com/docs/guides/fine-tuning/when-to-use-fine-
tuning
7
8. Difference between pretraining and fine-tuning 1/2
Pre-training Fine-tuning
Training time Weeks Hours
Compute Thousands of
GPUs
One or few
GPUs
Dataset Terabytes
(e.g., C4, Pile)
100-1000 MB
Budget $ Millions $ Hundreds
8
LLM
Pretraining
Pre-trained
LLM
Fine-tuning
Huge
datasets
and a lot
of compute
Small
dataset and
one or few
GPUs
9. Difference between pretraining and fine-tuning 2/2
Source: https://blog.research.google/2020/02/exploring-transfer-learning-with-t5.html
Terabytes of
input data
Thousands of
examples
Unsupervised learning
Supervised learning
9
10. Fine-tuning methods
• Full fine-tuning continues the initial training of the model using the
existing checkpoint.
• PEFT (Parameter-Efficient Fine-Tuning) methods only fine-tune a small
number of (extra) model parameters - significantly decreasing
computational and storage costs - while yielding performance comparable
to a fully fine-tuned model.
• LoRA is a way to train large models efficiently by inserting (typically in the attention
blocks) smaller trainable matrices to be learnt during finetuning.
• Prompt-based methods (p-tuning, prefix tuning, prompt tuning). Instead of manually
creating hard (text) prompts, soft prompting methods are applied by adding
learnable parameters to the input embeddings that can be optimized for a specific
task while keeping the pre-trained model’s parameters frozen.
10
Read more: https://huggingface.co/docs/peft/index
11. Pre-trained open-source LLMs
• Consider licenses that allow commercial use cases.
• A larger LLM has greater capabilities, but it also requires higher
computing resources.
• A larger context window allows adding more information into context.
• Most of the attractive models:
• Mistral with 7B params, 4096 tokens and 16K sliding window, Apache License
2.0
• Gemma with 7B params, 8192 tokens, Google’s Gemma Terms of use
11
Read mode: https://github.com/eugeneyan/open-llms
12. Libraries for fine-tuning
Library name Company Popularit
y ⭐
PEF
T
DL
Framework
Supported LLM models Links
Deep Speed Microsoft 31.5k ✅ PyTorch A lot docs, github
PEFT HuggingFace🤗 12.7k ✅ PyTorch LLaMA, Mistral, T5, GPT,
others
blog, github,
docs
Accelerate HuggingFace🤗 6.6k ✖️ PyTorch A lot github, docs
NeMo Nvidia 9.4k ✅ PyTorch LLaMA, Falcon, T5, GPT,
others
docs, github
T5X Google 2.3k ❔ JAX T5 and some others, PaLM* paper, github,
docs
Paxml Google 0.3k ❔ JAX PaLM-2* docs, github
12
14. Hardware for industrial needs
14
Read more: https://khairy2011.medium.com/tpu-vs-gpu-vs-cerebras-vs-
graphcore-a-fair-comparison-between-ml-hardware-3f5a19d89e38
• Nvidia GPU:
• H100 up to 80GB RAM
• Supports any framework
• Available in any cloud
• Requires NVLink/NVSwitch for efficient
data/model parallelism
• On-prem possibility
• Google TPU
• More const efficient
• V3-8 up to 128GB RAM
• Support XLA only: Jax, PyTorch/XLA, TF
• GCP lock
• Supports data/model parallelism out-of-the-
box
15. Other important topics
• Inference performance optimization by reducing memory footprint and
improving parallelizability.
• Efficient attention using lower-level hardware-aware optimizations (e.g., Flash
Attention)
• Quantization by reducing the computational precision of weights and activations
• Mixture of Experts to decrease inference time by not using all experts at once.
• and others.
• Hallucinations solutions
• Retrieval-augmented generation (RAG)
• Misleading behavior solutions
• Reinforcement Learning From Human Feedback (RLHF)
Kaddour, J., Harris, J., Mozes, M., Bradley, H., Raileanu, R. and McHardy, R., 2023. Challenges and applications
of large language models. arXiv preprint arXiv:2307.10169. 15
17. Data-to-Text generation
17
{
"city": "San Francisco",
"date": "2024-02-28",
"temperature": {
"high": 68,
"low": 51
},
"conditions": "Sunny",
"wind_speed": 5,
"humidity": 72
}
LLM
Today's weather report for
San Francisco on February
28, 2024, indicates a sunny
day with a high of 68°F and
a low of 51°F. Expect mild
wind speeds around 5 mph
and a humidity level of
72%. It's a beautiful day to
be outdoors!
18. Out setup
• Model: T5 3B, (encoder-decoder arch.)
• Library: T5X
• Hardware: single TPU v3.8 128GB (enough to fit T5 3B into single TPU
unit)
• Cloud: GCP (we can migrate model training to GPUs)
• Dataset: 300 000 train and 50 000 for A/B testing
• Supervised Full Fine-tuning.
18
19. Why T5 in 2021/2022?
• GPT-2/J/Neo vs T5.
• T5X mature fine-tuning
library.
• Large models up to 11B.
• Cost-efficient training using
TPUs and JAX.
• Apache 2.0 license.
• Llama and most other
popular open models were
released later (today, 2024)
19
GPT-2
T5
20. Training task definition
20
City: San Francisco
Date: 2024-02-28
Temperature: 68, 51
Conditions: Sunny
Wind speed: 5
Humidity: 72
T5 3B
Today's weather report for
San Francisco on February
28, 2024, indicates a sunny
day with a high of 68°F and
a low of 51°F. Expect mild
wind speeds around 5 mph
and a humidity level of
72%. It's a beautiful day to
be outdoors!
City: San Francisco
Date: 2024-02-28
Temperature: 68, 51
Conditions: Sunny
Wind speed: 5
Humidity: 72
City: San Francisco
Date: 2024-02-28
Temperature: 68, 51
Conditions: Sunny
Wind speed: 5
Humidity: 72
Today's weather report for
San Francisco on February
28, 2024, indicates a sunny
day with a high of 68°F and
a low of 51°F. Expect mild
wind speeds around 5 mph
and a humidity level of
72%. It's a beautiful day to
be outdoors!
Today's weather report for
San Francisco on February
28, 2024, indicates a sunny
day with a high of 68°F and
a low of 51°F. Expect mild
wind speeds around 5 mph
and a humidity level of
72%. It's a beautiful day to
be outdoors!
300,000 training pairs of input and output
21. Problems with smaller models
• Problems (in both the smallest size GPT and the smallest size T5 60m
params):
• Not natural-sounding enough
• Too short
• Texts have “errors” (e.g., wrong number values)
• Model used “forbidden phrases”
• “Messy” text: e.g., same phrases repeating twice within the same text,
sometimes in a row; incomplete sentences/brief expressions without context
at the end of the text, etc.
21
22. Solution
• Use larger model T5 3B
• Additional manipulations with input data:
• Cleaned up the text by removing irrelevant information.
• Supplying “forbidden phrases” as inputs to avoid using them liberally by the
model.
• Providing explicit labels in input prompt:
• “Low quality” – helps the model to distinguish the text quality.
• “Too short” – helps the model to distinguish between short and long descriptions.
• Text clustering using Locality-sensitive hashing (LSH) to find similar texts.
22
23. Regularize the model via prompting
23
23
City: San Francisco
Date: 2024-02-28
Temperature: 68, 51 Conditions:
Sunny
Wind speed: 5
Humidity: 72
Less formal
LLM
Today's weather report for San
Francisco on February 28, 2024,
indicates a sunny day with a high
of 68°F and a low of 51°F. Expect
mild wind speeds around 5 mph
and a humidity level of 72%. It's
a beautiful day to be outdoors!
City: San Francisco
Date: 2024-02-28
Temperature: 68, 51 Conditions:
Sunny
Wind speed: 5
Humidity: 72
LLM
Today's weather report for San
Francisco on February 28, 2024,
indicates a sunny day with a high
of 68°F and a low of 51°F. Expect
mild wind speeds around 5 mph
and a humidity level of 72%.
Training
Generation
Model learns the relationship between the input prompt and the required generated text.
25. Results evaluation
• Custom metrics.
• Example. Numbers of occurrences in the generated text:
• forbidden words
• unseen numbers
• Expert verification.
• Experts reviewed generated texts.
• Set of known examples that were monitored.
• A/B testing validation.
25
26. Conclusions
• While getting into the LLM field has a very steep
learning curve, collecting datasets is the most
important and time-consuming part.
• Larger models provide better results (e.g., 1B vs
3B). However, larger models require more ML
engineering efforts.
• Fine-tuning requires very high-quality data.
26
29. Anthropic’s RLHF dataset example
29
Source https://huggingface.co/datasets/Anthropic/hh-rlhf?row=0
30. Different usages of the models
• Prediction using zero-shot learning
• No need for additional training
• Very large models are required (>=100 billion parameters)
• Not-so-precise prediction
• Simple, prompt engineering is required
• It is almost impossible to influence the model’s generated text quality
• Few examples are required
• Prediction using fine-tunned model
• Requires model’s additional training (might take from hours to days)
• Not-so-large models are required (<11 billion parameters)
• More precise predictions
• Advanced knowledge is required to fine-tune the model
• Adjustable generated text quality by improving the dataset for fine-tuning
• >=10 000 examples are required
30
31. Using fine-tunned models
Choose cloud or
on-prem
pre-trained model
Fine-tuning
Use model for
prediction
Choose cloud or
on-prem
pre-trained model
Prompt
engineering
Use model for
prediction
Piece of
cake
Challenging
Better
generated text
quality
Not so precise
generated text
Using models with Zero-shot learning
31