Representation Capacity
CSE665: Large Language Models
Outline
Section 1: N-Gram and MLP Models
Section 2: RNNs and LSTM Models
Section 3: Transformers
Section 4: Esoteric “Transformer” Architectures
Section 5: Towards Natural Language Understanding
2
N-Gram and MLP Models
3
Q: Please pass the salt and pepper to
1) me
4) refrigerator
2) coffee
Fill in the blank!
3) yes
Q: Please pass the salt and pepper to
1) me
4) refrigerator
2) coffee
Fill in the blank!
3) yes
In the first place, what are language models?
Language models determine the probabilities of a series of words
Example: Find the probability of a word w given some sample text history h
W = “the”, h = “its water is so transparent that”
“Chain rule of probability” - Conditional probability of a word given previous words
P(“its water is so transparent that the”) = P(“its”) x “P(“water”|”its”) x P(“is” | “its water”).... x
P(“the”|”its water is so transparent that”)
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
bi-gram
Approximates conditional probability of a word given by using only
the conditional probability of the preceding word
P(“its water is so transparent that the”) = P(“its”) x “P(“water”|”its”) x P(“is” |
“water”).... x P(“the”| that”)
P(“the|the water is so transparent that”) = P(“the”|”that”)
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
bi-gram
Approximates conditional probability of a word given by using only
the conditional probability of the preceding word
We assume we can predict the probability of a future unit without
looking far into the past
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
n-gram
Approximates conditional probability of a word given by using only the
conditional probability of the (n-1) words
To estimate these probabilities, we use maximum likelihood estimation:
1) Getting counts of the n-grams from a given corpus
2) Normalizing the counts so they lie between 0 and 1
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
Issues with “count-based” approximations
Language is a creative exercise, many different permutations of
words could have the same meaning.
Probability would be zero if n-gram is absent in the corpus and not
dealt with (Sparse data problem).
Large amount of memory required - for a language with a
vocabulary of words with an -gram language model, we would
need to store values
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
A neural probabilistic language model
1) Take into consideration words of similar meaning
2) Should be able to take into consideration “longer” contexts
without incurring large memory resources
Ex: “The cat is walking in the bedroom” should be similar to ‘A dog
was running in a room”
TLDR: Instead of storing the permutations, lets learn an embedding of each token in the
vocab given the context.
[Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model”
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
A neural probabilistic language model
Main ideas:
1) Associating each word in the vocabulary with a word feature
vector
2) Expressing the joint probability function of word sequences in
terms of the word feature vectors
[Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model”
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
A neural probabilistic language model
Ex: “The cat is walking in the bedroom” should be similar to ‘A dog
was running in a room”
“the” should be similar to “a”
“Bedroom” should be similar to “room”
“Running” should be similar to “walking”
Hence, we should be able to generalize from “The cat is walking in
the bedroom” to “A dog was running in a room”, as similar words are
expected to have similar feature vectors
[Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model”
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
Neural network architecture proposed by Bengio et al., 2003
Embedding Layer
Hidden Layer
Probability Layer
[Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model”
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
A neural probabilistic language model
A neural probabilistic language model
Embedding Layer
A mapping from a word to a vector that describes it
Represented by a matrix, where represents the size of
vocabulary and represents the size of vector (30 - 100 in Bengio et
al., 2003)
Embeddings here are trained via the task at hand (predicting the
next word given a context)
[Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
Hidden Layer
It transforms input sequences of feature vectors and capture
contextual information
In the paper, a multi-layered perceptron was used, with hyperbolic
tangent functions included if hidden layers are used
[Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
Probability Layer
Produces a probability distribution over the words in the vocabulary
through the use of the softmax function
Output is a vector, where the i-th index of the vector represents
[Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
Effectiveness of the model
Test perplexity improvement of 24% compared to n-gram models
Able to take advantage of more contexts (2-gram to 4-gram contexts
benefitted the MLP approach, but not for the n-gram approach)
Inclusion of the hyperbolic tangent functions as hidden units
improved the perplexity of a given model
[Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
Limitations of N-gram models
Not able to grasp relations longer than window size (learning a 6
word relation with a 5-gram neural network is not possible)
Cannot model “memory” in the network (n-gram would only have
the context of (n-1) words.
The man went to the bank to deposit a check.
The children played by the river bank.
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
Need for Sequential modelling
In N-gram: For one word only one embedding. Does not take
context (memory) into account.
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
Outline
Section 1: N-Gram and MLP Models
Section 2: RNNs and LSTM Models
Section 3: Transformers
Section 4: Esoteric “Transformer” Architectures
Section 5: Towards Natural Language Understanding
22
RNNs and LSTMs (SKIP)
23
Transformer
24
Agenda: Transformer
- What is transformer?
- Encoder and decoder.
- Self attention.
- Probing
- Attention heads;
- Feedforward layers.
25
What is transformer?
- Motivation
26
Attention is all you need NIPS ‘17.
What is transformer?
- Encoder (left) and the decoder
(right)
- Q2: What is the connection
between encoder and decoder?
- Q3: Which components in
transformers are following models
using?
- BERT (masked language
modeling) uses which
component?
- T5 (seq2seq)?
- GPT (text generation)?
27
Source: Attention is all you need NIPS ‘17.
Encoder
Components in encoder:
- Multi-head attention
- FF
- Add & Normal
28
Source: Attention is all you need NIPS ‘17.
Primarily for optimization,
skip them in class.
Recall attention
29
Source: NUS CS4248 Natural Language Processing
Self-attention in Transformer
- Q: Query,
- K: Key
- V: Values
- Motivation?
- Similar to dot product attention.
- Scaling factor for stable gradient.
30
Source: Attention is all you need NIPS ‘17.
Walk through self attention -- Step 1
- We first transform the input
X into Q, K and V.
31
Source: https://jalammar.github.io/illustrated-transformer/
Walk through self attention -- Step 2
- Then perform the self
attention between Q, K and
V.
32
Source: https://jalammar.github.io/illustrated-transformer/
Walk through self attention -- Step 2 Example
33
Source: https://jalammar.github.io/illustrated-transformer/
Walk through self attention -- Step 3 -- Multihead
34
Source: https://jalammar.github.io/illustrated-transformer/
Why do we need
multi-head attention
anyway?
Encoder as “memory” for decoder
Encoder: MultiHead(Q, K, V)
- Q: Are Q, K, V the same?
Decoder: MultiHead(Q, K, V)
35
Encoder as “memory” for decoder
Encoder: MultiHead(Q, K, V)
- Q: Are Q, K, V the same?
- Yes!
Decoder: MultiHead(Q, K, V)
- Q: Are Q, K, V the same?
36
Encoder as “memory” for decoder
37
Source: NUS CS4248 Natural Language Processing
Masking for the decoder
38
Source: NUS CS4248 Natural Language Processing
Reading 1: Probing attention heads
39
Source: Revealing the Dark Secrets of BERT
40
41

Lectures 10-11_ Representation Capacity – Large Language Models.pdf

  • 1.
  • 2.
    Outline Section 1: N-Gramand MLP Models Section 2: RNNs and LSTM Models Section 3: Transformers Section 4: Esoteric “Transformer” Architectures Section 5: Towards Natural Language Understanding 2
  • 3.
  • 4.
    Q: Please passthe salt and pepper to 1) me 4) refrigerator 2) coffee Fill in the blank! 3) yes
  • 5.
    Q: Please passthe salt and pepper to 1) me 4) refrigerator 2) coffee Fill in the blank! 3) yes
  • 6.
    In the firstplace, what are language models? Language models determine the probabilities of a series of words Example: Find the probability of a word w given some sample text history h W = “the”, h = “its water is so transparent that” “Chain rule of probability” - Conditional probability of a word given previous words P(“its water is so transparent that the”) = P(“its”) x “P(“water”|”its”) x P(“is” | “its water”).... x P(“the”|”its water is so transparent that”) [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 7.
    bi-gram Approximates conditional probabilityof a word given by using only the conditional probability of the preceding word P(“its water is so transparent that the”) = P(“its”) x “P(“water”|”its”) x P(“is” | “water”).... x P(“the”| that”) P(“the|the water is so transparent that”) = P(“the”|”that”) [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 8.
    bi-gram Approximates conditional probabilityof a word given by using only the conditional probability of the preceding word We assume we can predict the probability of a future unit without looking far into the past [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 9.
    n-gram Approximates conditional probabilityof a word given by using only the conditional probability of the (n-1) words To estimate these probabilities, we use maximum likelihood estimation: 1) Getting counts of the n-grams from a given corpus 2) Normalizing the counts so they lie between 0 and 1 [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 10.
    Issues with “count-based”approximations Language is a creative exercise, many different permutations of words could have the same meaning. Probability would be zero if n-gram is absent in the corpus and not dealt with (Sparse data problem). Large amount of memory required - for a language with a vocabulary of words with an -gram language model, we would need to store values [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 11.
    A neural probabilisticlanguage model 1) Take into consideration words of similar meaning 2) Should be able to take into consideration “longer” contexts without incurring large memory resources Ex: “The cat is walking in the bedroom” should be similar to ‘A dog was running in a room” TLDR: Instead of storing the permutations, lets learn an embedding of each token in the vocab given the context. [Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model” [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 12.
    A neural probabilisticlanguage model Main ideas: 1) Associating each word in the vocabulary with a word feature vector 2) Expressing the joint probability function of word sequences in terms of the word feature vectors [Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model” [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 13.
    A neural probabilisticlanguage model Ex: “The cat is walking in the bedroom” should be similar to ‘A dog was running in a room” “the” should be similar to “a” “Bedroom” should be similar to “room” “Running” should be similar to “walking” Hence, we should be able to generalize from “The cat is walking in the bedroom” to “A dog was running in a room”, as similar words are expected to have similar feature vectors [Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model” [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 14.
    Neural network architectureproposed by Bengio et al., 2003 Embedding Layer Hidden Layer Probability Layer [Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model” [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing” A neural probabilistic language model
  • 15.
    A neural probabilisticlanguage model
  • 16.
    Embedding Layer A mappingfrom a word to a vector that describes it Represented by a matrix, where represents the size of vocabulary and represents the size of vector (30 - 100 in Bengio et al., 2003) Embeddings here are trained via the task at hand (predicting the next word given a context) [Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 17.
    Hidden Layer It transformsinput sequences of feature vectors and capture contextual information In the paper, a multi-layered perceptron was used, with hyperbolic tangent functions included if hidden layers are used [Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 18.
    Probability Layer Produces aprobability distribution over the words in the vocabulary through the use of the softmax function Output is a vector, where the i-th index of the vector represents [Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 19.
    Effectiveness of themodel Test perplexity improvement of 24% compared to n-gram models Able to take advantage of more contexts (2-gram to 4-gram contexts benefitted the MLP approach, but not for the n-gram approach) Inclusion of the hyperbolic tangent functions as hidden units improved the perplexity of a given model [Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 20.
    Limitations of N-grammodels Not able to grasp relations longer than window size (learning a 6 word relation with a 5-gram neural network is not possible) Cannot model “memory” in the network (n-gram would only have the context of (n-1) words. The man went to the bank to deposit a check. The children played by the river bank. [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 21.
    Need for Sequentialmodelling In N-gram: For one word only one embedding. Does not take context (memory) into account. [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 22.
    Outline Section 1: N-Gramand MLP Models Section 2: RNNs and LSTM Models Section 3: Transformers Section 4: Esoteric “Transformer” Architectures Section 5: Towards Natural Language Understanding 22
  • 23.
    RNNs and LSTMs(SKIP) 23
  • 24.
  • 25.
    Agenda: Transformer - Whatis transformer? - Encoder and decoder. - Self attention. - Probing - Attention heads; - Feedforward layers. 25
  • 26.
    What is transformer? -Motivation 26 Attention is all you need NIPS ‘17.
  • 27.
    What is transformer? -Encoder (left) and the decoder (right) - Q2: What is the connection between encoder and decoder? - Q3: Which components in transformers are following models using? - BERT (masked language modeling) uses which component? - T5 (seq2seq)? - GPT (text generation)? 27 Source: Attention is all you need NIPS ‘17.
  • 28.
    Encoder Components in encoder: -Multi-head attention - FF - Add & Normal 28 Source: Attention is all you need NIPS ‘17. Primarily for optimization, skip them in class.
  • 29.
    Recall attention 29 Source: NUSCS4248 Natural Language Processing
  • 30.
    Self-attention in Transformer -Q: Query, - K: Key - V: Values - Motivation? - Similar to dot product attention. - Scaling factor for stable gradient. 30 Source: Attention is all you need NIPS ‘17.
  • 31.
    Walk through selfattention -- Step 1 - We first transform the input X into Q, K and V. 31 Source: https://jalammar.github.io/illustrated-transformer/
  • 32.
    Walk through selfattention -- Step 2 - Then perform the self attention between Q, K and V. 32 Source: https://jalammar.github.io/illustrated-transformer/
  • 33.
    Walk through selfattention -- Step 2 Example 33 Source: https://jalammar.github.io/illustrated-transformer/
  • 34.
    Walk through selfattention -- Step 3 -- Multihead 34 Source: https://jalammar.github.io/illustrated-transformer/ Why do we need multi-head attention anyway?
  • 35.
    Encoder as “memory”for decoder Encoder: MultiHead(Q, K, V) - Q: Are Q, K, V the same? Decoder: MultiHead(Q, K, V) 35
  • 36.
    Encoder as “memory”for decoder Encoder: MultiHead(Q, K, V) - Q: Are Q, K, V the same? - Yes! Decoder: MultiHead(Q, K, V) - Q: Are Q, K, V the same? 36
  • 37.
    Encoder as “memory”for decoder 37 Source: NUS CS4248 Natural Language Processing
  • 38.
    Masking for thedecoder 38 Source: NUS CS4248 Natural Language Processing
  • 39.
    Reading 1: Probingattention heads 39 Source: Revealing the Dark Secrets of BERT
  • 40.
  • 41.