FINE-TUNING LLAMA 2: DOMAIN ADAPTATION OF A PRE-TRAINED MODEL

1/27
Fine-tuning Llama 2: An overview
leewayhertz.com/fine-tuning-llama2/
In the dynamic and ever-evolving field of generative AI, a profound sense of competition has taken root,
fueled by a relentless quest for innovation and excellence. The introduction of GPT by OpenAI has
prompted various businesses to work on creating their own Large Language Models (LLMs). However,
creating such sophisticated algorithms is like navigating through a maze of complexities. It demands
exhaustive research, a massive amount of relevant data and overcoming numerous other challenges.
Further, the substantial computational power required for these tasks remains a significant hurdle for
many.
Amidst this fiercely competitive landscape, where industry heavyweights like OpenAI and Google have
already etched their indelible marks, a new contender, Meta, entered the arena with their open-source
LLM, Llama, with a goal of democratizing AI. They subsequently upgraded it to Llama 2, which was
trained on 40% more data than its predecessor. While all large language models exhibit remarkable
efficiency, their adaptability to handle domain-specific inquiries, such as those related to a business’s
financial performance or inventory status, may be constrained. To empower these models with domain-
specific competence and elevate their precision, a refinement process called fine-tuning is implemented.
In this article, we will talk about fine-tuning Llama 2, a model that has opened up new avenues for
innovation, research, and commercial applications. This process of fine-tuning may be considered
imperative as it can yield numerous benefits like cost savings, secure management of confidential data,
and the potential to surpass renowned models like GPT-4 in specialized tasks.

2/27
So, let’s dive deeper into the article and explore the transformative power of Llama 2 in redefining the
boundaries of artificial intelligence, creating endless possibilities for businesses.
What is Llama 2?
Why use Llama 2?
Why does Llama 2 matter in the AI landscape?
How does Llama 2 work?
A thorough analysis of Llama 2 in comparison to other leading LLMs
What does fine-tuning an LLM mean?
Techniques for LLM fine-tuning
How can we perform fine-tuning on Llama 2?
PEFT approaches – LoRA and QLoRa
Fine-tuning the Llama 2 model with QLoRA
Challenges in fine-tuning Llama 2
How does LeewayHertz help in building Llama 2 model-powered solutions?
What is Llama 2?
Meta’s recent unveiling of the Llama 2 suite signifies an important milestone in the evolution of LLMs.
Launched in mid-July, Llama 2 emerges as a versatile series of both pre-trained and fine-tuned models,
characterized by its diverse parameter configurations of 7B, 13B, and 70B. This release included
comprehensive papers detailing the intricacies of its design, training, and implementation, offering
invaluable insights into the advancements made in the AI sector.
At the core of Llama 2’s development was an expansive training regimen built upon a staggering 2 trillion
tokens—marking a 40% increase from previous endeavors. Sophisticated architectural interventions such
as the grouped-query attention (GQA) mechanism further amplified this rigorous training. Particularly in
the 70B model, GQA expedites inference, ensuring optimal performance without compromising speed.
Furthermore, the model boasts a default context window of 4096 tokens, a significant advancement from
previous iterations and a testament to its enhanced capability to handle complex contextual information.
Architecturally, Llama 2 distinguishes itself from its peers through several innovative attributes. It
leverages the RMSNorm normalization, SwiGLU activation, and rotatory positional embedding to further
enhance its data processing prowess. Applying the Adam optimizer with a cosine learning rate schedule,
a weight decay of 0.1, and gradient clipping underscores Meta’s commitment to refining even the most
nuanced aspects of model development.
Yet, the true innovation of Llama 2 lies not merely in its architectural and training advancements but in its
fine-tuning strategies. Meta has judiciously prioritized quality over quantity in its Supervised Fine-Tuning
(SFT) phase, a decision inspired by numerous studies indicating the superior model performance
achieved through high-quality data. Complementing this is the Reinforcement Learning with Human
Feedback (RLHF) stage, meticulously designed to calibrate the model in line with user preferences.
Using a comparative approach where annotators evaluate model outputs, the RLHF process refines
Llama 2 to accentuate helpfulness and safety in its responses.

3/27
Furthermore, Llama 2’s commercial adaptability is evident in its open-source and commercial character,
facilitating ease of use and expansion. It’s not merely a static tool; it’s a dynamic solution optimized for
dialogue use cases, as seen in the Llama-2-chat versions available on the Hugging Face platform. While
the models differ in parameter size, their consistent optimization for both speed and accuracy
underscores their adaptability to diverse operational demands.
Overall, Llama 2, as a member of the Llama family of LLMs, not only aligns with the technical prowess of
contemporaries like GPT-3 and PaLM 2 but also introduces several groundbreaking innovations. Its
optimized transformer architecture, rigorous training, fine-tuning procedures, and open-source
accessibility position it as a formidable asset in the AI landscape, promising a future of more accurate,
efficient, and user-aligned AI solutions.
Why use Llama 2?
In today’s AI-driven landscape, responsibility and accountability take center stage. Meta’s Llama 2 is
evidence of this heightened focus on creating AI solutions that are transparent, accountable, and open to
scrutiny. This section delves into why Llama 2’s approach is pivotal in reshaping our understanding and
expectations of AI models.
Open source: The bedrock of transparency
Most LLMs, such as OpenAI’s GPT-3, GPT 4, Google’s PaLM and PaLM 2, and Anthropic’s Claude, have
predominantly been closed source. This limited accessibility restricts the broader research community
from fully understanding these models’ intricacies and decision-making processes. Llama 2 stands in
stark contrast. Being open source enables anyone with relevant technical expertise not just to access but
also to dissect, understand, and potentially modify the model. By enabling people to peruse the research
paper detailing Llama 2’s development and training and even download the model for personal or
business use, Meta is championing an era of transparency in AI.
Ensuring safety through red-teaming
Safety in AI is paramount, and Llama 2’s development process reflects this priority. Through internal and
third-party commissions, adversarial prompts were generated through intensive red-teaming exercises to
facilitate model fine-tuning. These rigorous processes are not just a one-time effort; they signify Meta’s
ongoing commitment to refining model safety iteratively. The intention is clear: ensuring Llama 2 is robust
against unforeseen challenges.
Transparent reporting: An insight into model evaluation
The research paper details Meta’s schematic transparency, highlighting the challenges encountered
during the development of Llama 2. By highlighting known issues and outlining the steps taken to mitigate
them – and those planned for future iterations – Meta is providing an open playbook on the model’s
strengths and areas for improvement.
Empowering developers: “Responsible use guide” and “Acceptable use policy”

4/27
With great power comes great responsibility. Acknowledging LLMs’ vast potential and inherent risks, Meta
has devised a “Responsible Use Guide” to steer developers towards best practices in AI development
and safety evaluations. Complementing this is an “Acceptable Use Policy,” which defines boundaries for
ensuring the responsible use of the model.
Engaging the global community
Meta recognizes the collective intelligence of the global community. Introducing initiatives such as the
Open Innovation AI Research Community invites academic researchers to share insights and research
on the responsible development of LLMs. Furthermore, the Llama Impact Challenge is a call to action for
public, non-profit, and for-profit entities to harness Llama 2 in addressing critical global challenges like
environmental conservation and education.
Launch your project with LeewayHertz
We specialize in fine-tuning pre-trained LLMs to ensure they offer domain-specific responses tailored to
your unique business requirements. For the specifics you’re looking for, contact us today!
Learn More
Why does Llama 2 matter in the AI landscape?
The global AI community has long awaited a shift from commercial monopolization towards open-source
research and experimentation. Meta’s Llama 2 heralds this change. By offering an open-source AI, Meta
ensures a credible alternative to closed-source AI. It democratizes AI, allowing other companies to
develop AI-powered applications under their control, bypassing the commercial constraints of tech giants
like Apple, Google, and Amazon.
Llama 2 is not just a technological marvel; it’s a statement on the importance of responsibility,
transparency, and collaboration in AI. It embodies a future where AI development prioritizes societal
benefits, open dialogue, and ethical considerations.
How does Llama 2 work?
Llama 2, a state-of-the-art language model, has been built using sophisticated training techniques to
understand and generate human-like text. To comprehend its operations, one must delve into its data
sources, training methodologies, and potential applications.
Data sources and neural network training
Llama 2’s foundational strength is attributed to its extensive training on a staggering 2 trillion tokens.
These tokens were sourced from publicly accessible repositories, including:
Common crawl: An expansive archive encompassing billions of web pages.
Wikipedia: The free encyclopedia offering a wealth of knowledge on myriad topics.
Project gutenberg: A treasure trove of public domain books.

5/27
Each token, be it a word or a semantic fragment, empowers Llama 2 to discern the meaning behind the
text. For instance, if the model consistently encounters “Apple” and “iPhone” together, it infers the
inherent relationship between these terms, distinguishing it from other related terms such as “apple” and
“fruit.”
Ensuring quality and mitigating bias
Given the vastness and diversity of the internet, training a model solely on such data can inadvertently
introduce biases or produce inappropriate content. Acknowledging this, the developers of Llama 2
incorporated additional training mechanisms:
Reinforcement Learning with Human Feedback (RLHF): This technique involves human testers
who evaluate multiple AI-generated responses. Their feedback is instrumental in guiding the model
towards generating more relevant and appropriate content.
Adaptation for conversational context
Llama 2’s chat versions were meticulously fine-tuned using specific data sets to enhance conversational
prowess. This ensures that when engaged in a dialogue, Llama 2 responds naturally, simulating human
interaction.
Customization and fine-tuning
One of Llama 2’s defining features is its adaptability. Organizations can mold it to resonate with their
unique brand voice. For instance, if a firm wishes to produce summaries reflecting its distinct style, Llama
2 can be trained on numerous examples to achieve this. Similarly, the model can be fine-tuned for
customer support optimization using FAQs and chat logs, allowing it to respond precisely to user queries.
Llama 2’s robustness and adaptability are products of its comprehensive training and fine-tuning
methodologies. Its ability to assimilate vast data, combined with human feedback mechanisms and
customization options, positions it at the forefront of the language model domain.
A thorough analysis of Llama 2 in comparison to other leading LLMs
The advancement of AI, especially in the domain of large language models, has been nothing short of
extraordinary. This is prominently demonstrated by Llama 2, an LLM designed with adaptability in mind to
empower developers and researchers to explore new horizons and create innovative applications. Here,
we explore the outcomes of some experiments carried out to evaluate how Llama 2 compares to giants
like OpenAI’s GPT and Google’s PaLM.
Creative aptitude: Llama 2 was prompted to simulate a sarcasm-laden dialogue on space
exploration; the resultant discourse, although impressive, was trailing slightly behind ChatGPT.
When compared with Google’s Bard, Llama 2 showcased a superior flair. Thus, while ChatGPT
remains the frontrunner in creative engagements, Llama 2 holds a commendable position amongst
its peers.

6/27
Programming capabilities: Llama 2 was pitted against ChatGPT and Bard in a coding challenge.
The task? To develop functional applications ranging from a basic to-do list to a Tetris game.
Although ChatGPT mastered each challenge, Llama 2, akin to Bard, efficiently crafted the to-do list
and an authentication system, stumbling only on the Tetris game.
Mathematical proficiency: Llama 2’s prowess in solving algebraic and logical math problems was
noteworthy, particularly when compared to Bard. However, ChatGPT’s mathematical proficiency
remained unmatched. Remarkably, Llama 2 excelled in certain problems where its predecessors, in
their early stages, had faltered.
Reasoning and commonsense: A facet that remains a challenge for many AI models is
commonsense reasoning. ChatGPT unsurprisingly led the pack. The contest for the second spot
was neck-to-neck between Bard and Llama 2, with Bard slightly edging out.
Llama 2, though an impressive foundational model, still has room for growth compared to certain other
specialized, fine-tuned models on the market. Foundational models like Llama 2 are designed with
versatility and future adaptability at their core, unlike fine-tuned models optimized for domain-specific
expertise. Given its nascent stage and its ‘foundational’ nature, the potential avenues for Llama 2’s
evolution are promising.
What does fine-tuning an LLM mean?
When discussing the fine-tuning of LLMs, it’s crucial to recognize that such practices extend beyond
language models. Fine-tuning can be applied across various machine learning models based on different
use cases.

7/27
Machine learning models are trained to identify patterns within given datasets. For instance, a
Convolutional Neural Network (CNN) designed to detect cars in urban areas would be highly proficient in
that domain due to training on relevant images. Yet, when faced with detecting trucks on highways, its
efficacy might decrease due to unfamiliarity with that data distribution. Rather than starting from scratch
with a new training dataset, fine-tuning allows for adjustments to be made to the model to accommodate
new data types.
Several advanced LLMs are available, including GPT-3, Bloom, BERT, T5, and XLNet. GPT-3, for
instance, is a premium model recognized for its vast training on 175 billion parameters, making it adept
for various natural language processing tasks. BERT, conversely, is a more accessible open-source
model excelling in understanding contextual word relationships. The choice between models like GPT-3
and BERT largely depends on the specific task at hand, be it text generation or text classification.
Techniques for LLM fine-tuning
The process of fine-tuning LLMs is intricate, with varying techniques ideal for specific applications.
Sometimes, the goal is to train a model to suit a novel task.
Imagine having a pre-trained LLM skilled in text generation, but you want it to perform sentiment analysis.
This will entail remodeling the model with subtle architectural tweaks before diving into the fine-tuning
phase.
In such a context, you will primarily harness the numeric vectors called embeddings generated by the
LLM’s transformer component. These embeddings carry detailed features of the given input.
Certain LLMs directly produce these embeddings, whereas others, such as the GPT series, use these
embeddings for token or text generation. During adaptation, the LLM’s embedding layer gets linked to a
classification system, typically a set of fully connected layers translating embeddings into class
probabilities. The emphasis here lies in training the classification segment using model-driven
embeddings.
While the LLM’s attention layers generally remain unchanged—offering computational efficiency—the
classifier requires a supervised learning dataset with text instances and their respective classifications.
The magnitude of your fine-tuning data relies on task intricacy and classifier specifics. Yet, occasions
demand a deeper adjustment, requiring unlocking attention layers for a full-blown fine-tuning project.
It’s worth noting that this intensive process is also dependent on the model size. Besides, there exist
strategies to streamline costs related to fine-tuning. Let’s delve deeper and explore some prominent fine-
tuning techniques.

8/27
Unsupervised versus supervised fine-tuning (SFT)
Sometimes, there’s a need to refresh the LLM’s knowledge base without necessarily changing its
behavior. If, for instance, you intend to adapt the model to medical terminologies or a novel
language, harnessing an expansive, unstructured dataset suffices. You can choose between
unsupervised pretraining with ample unstructured data or supervised fine-tuning with labeled
datasets for a specific task.Here, the goal is to immerse the model in a sea of tokens representative
of the new domain or anticipated input types. Leveraging vast unstructured datasets is scalable,
thanks to unsupervised or self-supervised methodologies.However, there are cases where merely
updating the model’s information reservoir falls short. An LLM’s behavior needs an overhaul,
necessitating a supervised fine-tuning (SFT) dataset, complete with prompts and expected
outcomes. This method is pivotal for models like ChatGPT, which are designed to be highly
responsive to user directives.

9/27
Reinforcement Learning from Human Feedback (RLHF)
In elevating SFT, some practitioners employ reinforcement learning from human feedback, which is
a complex procedure. Currently, only well-resourced organizations have the capacity to employ
RLHF. While RLHF techniques vary, they all emphasize human-guided LLM training. Human
reviewers assess the model’s outputs for certain prompts, guiding the model toward desired
results.Take ChatGPT by OpenAI as a RLHF benchmark. Human feedback aids in developing a
reward model mirroring human preferences. The LLM then undergoes rigorous reinforcement
learning to optimize its outcomes based on these reward pointers.

10/27
Parameter-efficient Fine-tuning (PEFT)
PEFT, an emerging field within LLM fine-tuning, tries to minimize the resources spent on updating
model parameters. PEFT techniques focus on limiting parameter alterations.One such method
gaining traction is the Low-rank Adaptation (LoRA). The essence of LoRA is that only certain
parameters need adjustments for downstream tasks. Thus, a compact matrix can capture task-
specific nuances.Implementing LoRA implies training this compact matrix rather than the entire
LLM’s parameters. Once trained, the LoRA model weights can either merge with the primary LLM or
be used during inference.Adopting techniques like LoRA can reduce fine-tuning expenditures
considerably while enabling the storage of numerous fine-tuned models ready for integration during
LLM operations.
Reinforcement Learning from AI Feedback (RLAIF)
Fine-tuning a Large Language Model (LLM) using Reinforcement Learning from AI Feedback (RLAIF)
involves a structured process that ensures the model’s behavior aligns with a set of predefined principles
or guidelines, often encapsulated in a Constitution. Here’s an overview of the steps involved in fine-tuning
an LLM using RLAIF:
Define the Constitution
Constitution creation: Begin by defining the Constitution, a document or set of guidelines that
outlines the principles, ethics, and behavioral norms that the AI model should adhere to. This
Constitution will guide the AI Feedback Model in generating preferences.

11/27
Set up the AI feedback model
Model selection: Choose or develop an AI feedback model capable of understanding and applying
the principles outlined in the Constitution.
Model training (if necessary): If the AI feedback model isn’t pre-trained, you might need to train it
to interpret the Constitution and evaluate responses based on it. This could involve supervised
learning, using a dataset where responses are annotated based on their alignment with
constitutional principles.
Generate feedback data
Feedback generation: Use the AI feedback model to evaluate pairs of prompt/response instances.
For each pair, the model assigns a preference score, indicating which response aligns better with
the principles in the Constitution.
Train the Preference Model (PM)
Data preparation: Organize the AI-generated feedback into a dataset suitable for training the
Preference Model (PM).
Preference model training: Train the model on this dataset. It learns to predict the preferred
response to a given prompt based on the feedback scores provided by the AI feedback model.
Fine-tune the LLM
Integration with reinforcement learning: Integrate the trained preference model into a
reinforcement learning framework. In this setup, the preference model provides the reward signal
based on how well a response from the LLM aligns with the constitutional principles.
LLM fine-tuning: Fine-tune the LLM using this reinforcement learning setup. The LLM generates
responses to prompts, and the responses are evaluated by the PM. The LLM then adjusts its
parameters to maximize the reward signal, effectively learning to produce responses that better
align with the constitutional principles.
Evaluation and iteration
Model evaluation: After fine-tuning, evaluate the LLM’s performance to ensure it aligns with the
desired principles and effectively handles a variety of prompts.
Feedback loop: If the performance is not satisfactory or if there’s room for improvement, you might
need to iterate over the process. This could involve refining the Constitution, adjusting the AI
feedback model, retraining the preference model, or further fine-tuning the LLM.
Deployment and monitoring
Deployment: Once the fine-tuning process meets the performance and ethical standards, deploy
the model.
Continuous monitoring: Regularly monitor the model’s performance and behavior to ensure it
continues to align with the constitutional principles, adapting to new data and evolving
requirements.

12/27
Fine-tuning an LLM using RLAIF is a complex process that involves careful design, consistent evaluation,
and ongoing adjustment to ensure that the model’s behavior aligns with human values and ethical
standards. It’s a dynamic process that benefits from continuous monitoring and iterative improvement.
Learn More
How can we perform fine-tuning on Llama 2?
PEFT approaches – LoRA and QLoRA
Parameter-efficient Fine-tuning (PEFT) presents an effective approach to fine-tuning LLMs. Distinct from
traditional methods that mandate extensive parameter updates, PEFT focuses on refining a select subset
of parameters, minimizing computational demands and expediting the training process. By gauging the
significance of individual parameters based on their influence on the overall model, PEFT prioritizes those
with maximal impact. Consequently, only these pivotal parameters undergo adjustments during the fine-
tuning phase, while others remain static. Such a strategy curtails computational and temporal overheads
and paves the way for swift model iteration and deployment. As PEFT emerges as a frontrunner in
optimization techniques, it’s vital to recognize that it remains a dynamic field, with continuous research
ushering in nuanced variations and enhancements. The choice of PEFT application will invariably depend
on specific research goals and practical contexts.
PEFT is an innovative approach that effectively reduces RAM and storage demands. It achieves this by
primarily refining a select set of parameters while maintaining the majority in their original state. PEFT’s
strength lies in its ability to foster robust generalization even when datasets are of limited volume.
Moreover, it augments the model’s reusability and transferability. Small model checkpoints, derived from
PEFT, seamlessly integrate with the foundational model, promoting versatile fine-tuning across diverse
scenarios by incorporating PEFT-specific parameters. A salient feature is the preservation of insights
from the pre-training phase, ensuring the model remains resilient to extensive memory loss or
catastrophic forgetting.
Prominent PEFT strategies emphasize the integrity of the pre-trained base, introducing supplementary
layers or parameters termed “Adapters.” Through a process dubbed “adapter-tuning,” these layers are
integrated with the foundational model, with tuning efforts concentrated on the novel layers alone. A
notable challenge with this model is the heightened latency during the inference stage, potentially
hampering efficiency in various contexts.
Parameter-efficient fine-tuning has become a pivotal area of focus within AI, and there are myriad
techniques to achieve this. Among these, the Low-rank Adaptation (LoRA) and its enhanced counterpart,
QLoRA, are distinguished for their effectiveness.
Low-rank Adaptation (LoRA)

13/27
LoRA introduces an innovative paradigm in model fine-tuning, offering a modular method adept at
domain-specific tasks and transferring learning capabilities. The intrinsic beauty of LoRA lies in its ability
to be executed using minimal resources while being memory-conservative.
A closer examination of the LoRA technique reveals the following steps and intricacies:
Pre-trained parameter preservation: The original neural network’s foundational parameters (W)
remain unaltered during the adaptation process.
Inclusion of new parameters: Accompanying this original setup, supplementary networks
(denoted as WA and WB) are embedded. These networks champion the use of low-rank vectors.
The dimensionality of these vectors (dxr and rxd) is purposefully diminished compared to the
original network’s dimensions. Here, ‘d’ symbolizes the original vector’s dimension, and ‘r’ denotes
the low rank. Notably, a smaller ‘r’ accelerates training, although it may require a fine balance to
maintain optimal performance.
Dot product calculation: Both the original and low-rank networks are intertwined through a dot
product, generating an ‘n’-dimensional weight matrix that informs the model’s results.
Loss function computation: The loss function is discerned by contrasting the derived results
against expected outputs. Traditional backpropagation methods are then harnessed to calibrate the
WA and WB weights.
The LoRA’s essence is its economical memory footprint and infrastructure demands. For instance, given
a 512×512 parameter matrix in a typical feed-forward network (equivalent to 262,144 parameters), by
leveraging a LoRA adapter with a rank of 2, only 2,048 parameters (512×2 for both WA and WB) undergo
domain-specific data training. This streamlined process significantly elevates computational efficiency.

14/27
An exceptional facet of LoRA is its modular design. The trained adapter can be retained as an
independent entity, serving as a modular component for specific domains. Furthermore, LoRA adeptly
bypasses potential catastrophic memory loss by abstaining from modifying the foundational weights.
Further developments: QLoRA
To further accentuate the effectiveness of LoRA, QLoRA has been introduced as an augmented
technique, promising enhanced optimization and performance. This advanced method builds upon the
foundational principles of LoRA, optimizing it for even more intricate tasks.
QLoRA builds upon LoRA to further optimize efficiency by converting the weight values of the original
network from high-definition formats, like Float32, to more compact types, such as int4. This conversion
reduces memory usage and accelerates computational speeds.
QLoRA introduces three primary enhancements over LoRA, establishing it as a leading method in PEFT.
1. 4-bit NF4 quantization
Using 4-bit NormalFloat4 is a strategic move to decrease the storage requirements. This process is
divided into three phases:
Normalization & quantization: Here, weights are shifted to a neutral mean and a consistent unit
variance. Given that a 4-bit data format can hold just 16 distinct values, weights are aligned with the
closest among these 16 based on their relative position. For example, if there’s an FP32 weight of
value 0.2121, its nearest 4-bit equivalent would be stored, not the exact value.
Dequantization: This is the reverse process. Post-training, the original weights, which had been
adjusted, are restored to their near-original form.
Double quantization: This phase enhances memory optimization further. Grouping quantization
values and applying an 8-bit quantization can result in a significant reduction in memory usage. In
essence, for a model with 1 million parameters, the memory demand can be slashed to around
125,000 bits.
2. Unified memory paging

15/27
Together with the quantization methods, QLoRA leverages nVidia’s unified memory capabilities. This
feature facilitates smooth transfers between GPU and CPU memory. This is particularly useful during
memory-intensive operations or unexpected GPU demand spikes, ensuring no memory overflow.
While both LoRA and QLoRA are at the forefront of PEFT, QLoRA’s advanced techniques offer superior
efficiency and optimization.
Fine-tuning the Llama 2 model with QLoRA
Let’s delve into the process of fine-tuning the Llama 2 model, which features a massive 7 billion
parameters. We will harness the computational power of a T4 GPU, backed by high RAM, available on
Google Colab at a rate of 2.21 credits per hour. It’s worth noting that the T4 comes equipped with 16 GB
of VRAM. Now, when you consider the weight of Llama 2-7b (7 billion parameters equating to 14 GB in
FP16 format), the VRAM is stretched almost to its limit. This scenario doesn’t even factor in additional
overheads such as optimizer states, gradients, and forward activations. The implication is clear:
traditional fine-tuning won’t work here. We need to apply parameter-efficient fine-tuning techniques, such
as LoRA or QLoRA.
One way to significantly cut down on VRAM usage is by fine-tuning the model using 4-bit precision. This
makes QLoRA an apt choice. Fortunately, the Hugging Face ecosystem is equipped with libraries like
transformers, accelerate, peft, trl, and bitsandbytes to facilitate this. Our step-by-step code is inspired by
the contributions of Younes Belkada on GitHub. We initiate the process by installing and activating these
libraries.
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7
import os
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
HfArgumentParser,
TrainingArguments,
pipeline,
logging,

16/27
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
Let’s delve into the adjustable parameters in this context. We will begin by loading the llama-2-7b-chat-hf
model, commonly referred to as the chat model. Our aim is to train this model using the dataset
mlabonne/guanaco-llama2-1k, which comprises 1,000 samples. Upon completion, the resulting fine-tuned
model will be termed llama-2-7b-miniguanaco. For those curious about the origin and creation of this
dataset, a detailed notebook is available for review. However, do note that customization is possible. The
Hugging Face Hub boasts a plethora of valuable datasets, including the notable databricks/databricks-
dolly-15k.
In employing QLoRA, we will set the rank at 64, coupled with a scaling parameter of 16. Our approach
involves loading the Llama 2 model directly in 4-bit precision, specifically employing the NF4 type, and
then training it over a single epoch. For insights into other associated parameters, you are encouraged to
explore the TrainingArguments, PeftModel, and SFTTrainer documentation.
# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/Llama-2-7b-chat-hf"
# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"
# Fine-tuned model name
new_model = "llama-2-7b-miniguanaco"
################################################################################
# QLoRA parameters
################################################################################
# LoRA attention dimension
lora_r = 64
# Alpha parameter for LoRA scaling
lora_alpha = 16
# Dropout probability for LoRA layers
lora_dropout = 0.1

17/27
################################################################################
# bitsandbytes parameters
################################################################################
# Activate 4-bit precision base model loading
use_4bit = True
# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"
# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"
# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False
################################################################################
# TrainingArguments parameters
################################################################################
# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"
# Number of training epochs
num_train_epochs = 1
# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False
# Batch size per GPU for training
per_device_train_batch_size = 4
# Batch size per GPU for evaluation
per_device_eval_batch_size = 4
# Number of update steps to accumulate the gradients for

18/27
gradient_accumulation_steps = 1
# Enable gradient checkpointing
gradient_checkpointing = True
# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3
# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4
# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001
# Optimizer to use
optim = "paged_adamw_32bit"
# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"
# Number of training steps (overrides num_train_epochs)
max_steps = -1
# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03
# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True
# Save checkpoint every X updates steps
save_steps = 25
# Log every X updates steps
logging_steps = 25
################################################################################
# SFT parameters

19/27
################################################################################
# Maximum sequence length to use
max_seq_length = None
# Pack multiple short examples in the same input sequence to increase efficiency
packing = False
# Load the entire model on the GPU 0
device_map = {"": 0}
Let’s commence the fine-tuning process, integrating various components for this task.
Learn More
Initially, we will source the previously defined dataset. It’s pertinent to note that our dataset is already
refined; however, under typical circumstances, this step would entail reshaping prompts, filtering out
inconsistent text, amalgamating multiple datasets, and so forth.
Subsequently, we will set up bitsandbytes to facilitate 4-bit quantization.
Following this, we will instantiate the Llama 2 model in 4-bit precision on a GPU, aligning it with the
appropriate tokenizer.
To conclude our preparations, we will initialize the configurations for QLoRA, outline the standard training
parameters, and forward all these settings to the SFTTrainer. With everything in place, the training
journey begins!
# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")
# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
bnb_config = BitsAndBytesConfig(
load_in_4bit=use_4bit,
bnb_4bit_quant_type=bnb_4bit_quant_type,

20/27
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=use_nested_quant,
)
# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
major, _ = torch.cuda.get_device_capability()
if major >= 8:
print("=" * 80)
print("Your GPU supports bfloat16: accelerate training with bf16=True")
print("=" * 80)
# Load base model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# Load LoRA configuration
peft_config = LoraConfig(
lora_alpha=lora_alpha,

21/27
lora_dropout=lora_dropout,
r=lora_r,
bias="none",
task_type="CAUSAL_LM",
)
# Set training parameters
training_arguments = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
optim=optim,
save_steps=save_steps,
logging_steps=logging_steps,
learning_rate=learning_rate,
weight_decay=weight_decay,
fp16=fp16,
bf16=bf16,
max_grad_norm=max_grad_norm,
max_steps=max_steps,
warmup_ratio=warmup_ratio,
group_by_length=group_by_length,
lr_scheduler_type=lr_scheduler_type,
report_to="tensorboard"
)
# Set supervised fine-tuning parameters

22/27
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
packing=packing,
)
# Train model
trainer.train()
# Save trained model
trainer.model.save_pretrained(new_model)
The duration of the training process can vary significantly based on your dataset’s size. In this instance, it
was completed in under an hour using a T4 GPU. To review the progress visually, one can refer to the
plots available on Tensorboard.

23/27
%load_ext tensorboard
%tensorboard --logdir results/runs
Let’s confirm that the model is operating as expected. While a comprehensive evaluation would be ideal,
we can utilize the text generation pipeline for preliminary assessments by posing questions such as,
“What is a large language model?” It’s crucial to adjust the input to align with Llama 2’s prompt structure.
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)
# Run text generation pipeline with our next model
prompt = "What is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"[INST] {prompt} [/INST]")
print(result[0]['generated_text'])
The model provides the subsequent answer:
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1270: UserWarning: You have
modified the pretrained model configuration to control generation. This is a deprecated strategy to control
generation and will be removed soon, in a future version. Please use a generation configuration file (see
https://huggingface.co/docs/transformers/main_classes/text_generation )
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs
have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
INST] What is a large language model? [/INST] A large language model is a type of artificial intelligence
(AI) model that is trained on a large dataset of text to generate human-like language outputs. It is
designed to be able to understand and generate text in a way that is similar to human language, and can
be used for a variety of applications such as chatbots, language translation, and text summarization.
Large language models are typically trained using deep learning techniques, such as recurrent neural
networks (RNNs) or transformer models, and are often based on pre-trained models such as BERT or
RoBERTa. These models are trained on large datasets of text, such as books, articles, or websites, and
are designed to learn the patterns and structures of language.
Some examples of large language models include:
* BERT (Bidirectional Encoder Representations from Transformers):

24/27
Drawing from our observations, the coherence demonstrated by a model encompassing merely 7 billion
parameters is quite impressive. Feel free to experiment further by posing more complex questions,
perhaps drawing from datasets like BigBench-Hard. Historically, the Guanaco dataset has been pivotal in
crafting top-tier models. To achieve this, consider training a Llama 2 model utilizing the
mlabonne/guanaco-llama2 dataset.
So, how do we save our refined llama-2-7b-miniguanaco model? The key lies in integrating the LoRA
weights with the foundational model. Presently, a direct, seamless method to achieve this eludes us. The
procedure involves reloading the base model in FP16 precision and harnessing the capabilities of the peft
library for amalgamation. Regrettably, this approach has occasionally been met with VRAM-related
challenges, even after its clearance. It might be beneficial to restart the notebook, initiate the primary
three cells, and then progress to the subsequent one.
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
low_cpu_mem_usage=True,
return_dict=True,
torch_dtype=torch.float16,
device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()
# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
Having successfully combined our weights and reinstated the tokenizer, we are positioned to upload the
entirety to the Hugging Face Hub, ensuring our model’s preservation.
!huggingface-cli login
model.push_to_hub(new_model, use_temp_dir=False)

25/27
tokenizer.push_to_hub(new_model, use_temp_dir=False)
This model is now ready for inference and can be accessed and loaded from the Hub just as you would
with any other Llama 2 model.
Challenges in fine-tuning Llama 2
Navigating the fine-tuning process
Fine-tuning LLMs like Llama 2 presents its unique set of complexities, differing from standard text-to-text
model adaptations. The process remains intricate for enterprise applications even with supportive
libraries like HuggingFace’s transformers and trl. Key challenges include:
Absence of a standard interface to set prompt and task descriptors and to adjust datasets in
alignment with these parameters.
The multitude of training parameters that necessitate manual configuration tailored to specific
datasets.
The onus is establishing, managing, and scaling a robust infrastructure for fine-tuning distributed
models. Achieving optimal performance with a model with around 7B parameters becomes
challenging, especially when considering GPU memory constraints. Understanding and deploying
distributed training effectively mandates a deep-rooted understanding of the subject.
Securing computational assets
LLMs, by nature, are voracious consumers of computational resources. Their memory, power, and time
demands are lofty, constraining entities lacking these resources. This disparity can act as a barrier to
universalizing the fine-tuning process.
Streamlining distributed model training
The sheer size of LLMs like Llama 2 makes it impractical to house them on a singular GPU, barring a few
like the A100s. This necessitates a shift from standard parallel training to either model parallel or pipeline
parallel training, whereby model weights are disseminated across multiple GPU instances. Open-source
tools such as Deepspeed facilitate this, but mastering its vast array of configurable parameters can be
daunting. Incorrect configurations can lead to memory overflow on CPUs/GPUs or suboptimal GPU
usage due to unwarranted offloading, elevating training costs.
How does LeewayHertz help in building Llama 2 model-powered solutions?
LeewayHertz, a seasoned AI development company, offers expert solutions in fine-tuning the Llama 2
model to build custom solutions aligned with specific organizational needs and objectives. Here is how we
can help you:
Strategic consulting

26/27
Our consulting process begins by deeply understanding your organization’s goals, challenges, and
competitive landscape. We then recommend the most appropriate Llama 2 model-powered solution
tailored to your specific needs. Finally, we develop a comprehensive implementation strategy, ensuring
the solution aligns perfectly with your objectives and positions your organization for success in the rapidly
evolving tech landscape.
Data engineering for Llama 2
With precise data engineering, we transform your organization’s valuable data into a powerful asset for
the development of highly effective Llama 2 model-powered solutions. Our skilled developers carefully
prepare your proprietary data, making sure it meets the necessary standards for fine-tuning the Llama 2
model, thus optimizing its performance to the fullest potential.
Fine-tuning expertise in Llama 2
We fine-tune the Llama 2 model with your proprietary data for domain-specific performance and build a
customized solution around it. This approach ensures the solution delivers accurate and meaningful
responses within your unique context.
Custom Llama 2 solutions
We ensure innovation, efficiency, and a competitive edge with our expertly developed Llama 2 model-
powered solutions. Whether you need chatbots for personalized customer interactions, intelligent content
generators, or context-aware recommendation systems, our Llama 2 model-powered applications are
meticulously crafted to enhance your organization’s capabilities in the dynamic AI landscape.
Seamless integration of Llama 2
We ensure that the Llama 2 model-powered solutions we develop seamlessly align with your existing
processes. Our approach involves analyzing your workflows, identifying key integration points, and
developing a customized integration strategy. This minimizes disruptions while maximizing the benefits of
our solutions, facilitating a smooth transition for your organization into a more efficient, AI-enhanced
operational environment.
Continuous evolution: Upgrades and maintenance
We ensure to keep your Llama 2 model-powered application up-to-date and performance-optimized with
our comprehensive upgrade and maintenance services. We diligently monitor emerging trends, security
updates, and advancements in AI technology, ensuring your application stays competitive and secure in
the rapidly evolving tech landscape.
Endnote
This article discusses the intricacies of fine-tuning the Llama 2 7b model leveraging a Colab notebook.
We laid the foundational understanding of LLM training and the intricacies of fine-tuning, shedding light
on the significance of instruction datasets. We effectively adapted the Llama 2 model in our practical
section, ensuring compatibility with its intrinsic prompt templates and tailored parameters.

27/27
When incorporated into platforms like LangChain, these refined models emerge as potent alternatives to
offerings like the OpenAI API. It’s imperative to recognize that instruction datasets stand paramount in the
evolving landscape of language models. The efficacy of your model is intrinsically tied to the quality of its
training data. As you embark on this journey, prioritizing high-caliber datasets becomes crucial.
Navigating the complexities of models like Llama 2 may appear challenging, but the rewards are
substantial with diligent application and a clear roadmap. Harnessing the prowess of these advanced
LLMs for targeted tasks can enhance applications, ushering in a new era of linguistic computing.
Don’t let pre-trained models limit your vision. Our extensive development experience and LLM fine-tuning
expertise enable us to build robust custom LLMs tailored to businesses’ specific needs. Contact our AI
experts today and harness the limitless power of LLMs!

FINE-TUNING LLAMA 2: DOMAIN ADAPTATION OF A PRE-TRAINED MODEL

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to FINE-TUNING LLAMA 2: DOMAIN ADAPTATION OF A PRE-TRAINED MODEL

Similar to FINE-TUNING LLAMA 2: DOMAIN ADAPTATION OF A PRE-TRAINED MODEL (20)

More from ChristopherTHyatt

More from ChristopherTHyatt (20)

Recently uploaded

Recently uploaded (20)

FINE-TUNING LLAMA 2: DOMAIN ADAPTATION OF A PRE-TRAINED MODEL