We present an approach. We can deploy large models locally in an efficiently realizable way. Targeting some domain-specific applications will help a lot.
Local Applications of Large Language Models based on RAG.pptx
1. Local Applications of Large Language
Models based on RAG (Retrieval
Augmented Generation)
——Local Documents Q&A
Luo Weizhi
2. 1. Large language modeling
2. Key structures of model transformers
3. Advantages of comparing RNN networks
4. Large language models llama2
5. The fine-tuning process for quiz datasets.
6. Langchain and chaining concepts
7. RAG (Retrieval Augmented Generation)
8. Demonstration of the project
9. Conclusion
4. LLM
A large language model (LLM) is a language model
notable for its ability to achieve general-purpose
language generation and other natural language
processing tasks(GPT-4).
The reason why LLM is said to be large is because
the size of the relevant text dataset used for training
is very large. A model of size 7b (7 billion
parameters) is the smallest size of LLM. But it has a
very large dataset.LLMs acquire these abilities by
learning statistical relationships from text
documents during a computationally intensive self-
supervised and semi-supervised training process.
LLMs can be used for text generation, a form of
generative AI, by taking an input text and repeatedly
predicting the next token or word.
7. Transformer
Current LLMs are designed based on transformer's
network architecture.It’s a new Encoder-Decoder
structure, encoder with multi-head attention
mechanism and feed-forward network, decoder with
extra mask part
Self-Attention: The core of Transformer enables the
model to take into account the interactions and
dependencies between the elements of the sequence
when processing the sequence data. Self-attention
allows the model to dynamically focus on different
parts of the input sequence as it generates each output,
which is critical to understanding the context and
meaning of the text.
Positional Encoding: Since the Transformer is
entirely based on the attention mechanism and lacks
the ability to deal with sequence order, Positional
Encoding provides information about the position of
individual elements in a sequence by adding
additional information to the input elements.
8. Transformer
Multi-Head Attention: The attention mechanism is
decomposed into multiple "heads", each of which
learns information from a different representation
subspace, which allows the model to capture data
features from multiple perspectives at the same time.
Feed-Forward Networks: In each Transformer
block, the output of the self-attention layer is passed
to a feed-forward network, which is the same for
each position, but is applied independently at
different positions.
9. Transformer
Self-Attention mechanism
The self-attention mechanism allows the model to capture contextual relationships within a sequence
by taking into account other elements in the sequence as each element of the sequence is processed.
The mathematical expression for self-attention is:
Q,K,V are the Query, Key, and Value matrices, respectively, which are obtained by multiplying the
embedding vectors of the input sequence with three different weight matrices.
dk is the dimension of the key vector, which is used to scale the dot product to prevent the dot product from
being too large and causing the softmax function to be in the saturation region, thus affecting the
backpropagation of the gradient.
QKT denotes the dot product of the query and key, which is used to compute the similarity between the
positions in the input sequence.
The softmax function is used to convert the similarity into weights.
10. Transformer
Multi-Head Attention
The multi-head attention mechanism divides self-attention into multiple "heads".In layman's terms,
it's better to get 8 different people to do the same thing than just 1 person. Each of which captures
information in a different representation subspace:
Wi
Q ,Wi
K,Wi
V,WO is the trainable weight matrix.
h is the number of heads.
The information in different representation subspaces can be fused by concatenating the outputs of different
heads and multiplying them by the output weight matrix WO.
11. Transformer
Position-wise Feed-Forward Networks
A position feed-forward network follows each attention layer, which performs independent linear
transformations on the representation of each position:
This is a two-layer fully-connected feed-forward network where the max(0,x) represents the ReLU
activation function.
W1,W2 and b1,b2 are the weights and biases of the network.
12. Transformer
Output Layer
Ultimately, the transformer generates predictions for each element of the output sequence using the
linear and soft-max layers:
Here X is the output of the last decoder layer.
W and b are the weights and biases of the output layer.
16. Limitations of RNN models
Processes language sequentially in a left-to-right or right-to-left manner. Reading one word at a time forces the RNN
to perform multiple steps to make decisions based on words that are far from each other. The more such steps
required to make decisions, the harder it is for the recurrent network to learn how to make those decisions. That is,
the depth of the network does not match the number of words and learning is difficult.
• Gradient extinction, explosion problem, etc.
Practically impossible to express entire sentences in terms of -vectors.
• Difficult to express complex structures such as sequential information
Probability of each word
Hidden layer of RNN
Word Embedding Layers
For all time steps t of
y1, ... ,yt the last word
in the sequence,
corresponding to a
vector will be fed into
the model
18. Llama 2
llama2 is the second generation of
large language models introduced
by Facebook's ai lab-Meta. It has 4
versions 7b,13b,34b,70b.
Depending on my hardware, we
will download and fine tune its
original 7b model. Although 7b is
the smallest model in terms of
volume, it has about 7 billion
tunable weights and bias
parameters. These parameters are
learned from large amounts of
textual data during the training of
the model to capture and model
language complexity, contextual
relationships, and subtle patterns
in language use.
We can download many free base models in:https://huggingface.co/
19. Llama 2
Then this process gives the base model a generalized prediction capability
21. Fine-tuning
LLMs are pretrained on an extensive corpus of text. In the case of Llama 2, we know very little about the
composition of the training set, besides its length of 2 trillion tokens. In comparison, BERT (2018) was “only”
trained on the BookCorpus (800M words) and English Wikipedia (2,500M words). From experience, this is a very
costly and long process with a lot of hardware issues.
When the pretraining is complete, auto-regressive models like Llama 2 can predict the next token in a sequence.
However, this does not make them particularly useful assistants since they don’t reply to instructions. This is why
we employ instruction tuning to align their answers with what our project expect.
22. Fine-tuning
There are two main mainstream fine-tuning techniques:
Supervised fine-tuning (SFT): Trains the model on the dataset of instructions and answers. It minimizes
the difference between generated answers and ground truth answers as labels by adjusting the weights in
the LLM.
Reinforcement Learning with Human Feedback (RLHF): The model learns by interacting with the
environment and receiving feedback. The model is trained to maximize the reward signal (using PPO),
which usually comes from human evaluation of the model output.
In general, RLHF has been shown to capture more complex and nuanced human preferences, but it is also
more challenging to implement effectively. Indeed, the process is systematic, as well as voluminous.
Thus, in my project we will be implementing SFT, but this raises the question: do we need to know why
fine-tuning works in the first place? As emphasized in the Orca[1] paper, my understanding is that fine-
tuning leverages the knowledge learned during the pre-training process. In other words, if the model has
never seen the type of data we are interested in, then fine-tuning will not help.
[1]:Subhabrata et al., Orca: Progressive Learning from Complex Explanation Traces of GPT-4
24. Fine-tuning
For our hardware conditions: the RAM is 16gb,and
Llama 2-7b weights (in FP16, 7b × 2 bytes = 14
GB)!
First, we have to load our defined dataset. Here,
our dataset has been preprocessed, but typically, we
can reformat cues, filter out error text, merge
multiple datasets, and so on.
Then, we configure bitsandbytes4 bit quantization.
Next, we load the Llama 2 model with 4-bit
precision on the GPU using the appropriate splitter.
Finally, we load the configuration of QLoRA, the
general training parameters, and pass everything to
SFTTrainer.Training
25. Fine-tuning
For this nearly 4-hour
fine-tuning process, we
ensured that the model's
fine-tuning behavior was
correct.!