Sequence models

Sequence Models
Roozbeh Sanaei

2
Roozbeh Sanaei https://towardsdatascience.com/an-overview-for-text-representations-in-nlp-311253730af1
One-Hot Word Representations

3
Roozbeh Sanaei https://en.wikipedia.org/wiki/File:Recurrent_neural_network_unfold.svg
Recurrent Neural Networks

4
Roozbeh Sanaei
https://www.semanticscholar.org/paper/Backpropagation-through-time-and-the-brain-Lillicrap-
Santoro/42ce761c85bdb0d422917b03751ab9cbc72a3417/figure/0
Backpropagation through time

5
Roozbeh Sanaei https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks
Different types of RNNs
One to One One to Many Many to One
Many to Many (Same size)
Many to Many (Different size)

6
Roozbeh Sanaei
https://lorenlugosch.github.io/posts/2019/02/seq2seq/
Sequence models as Markov Decision Processes
𝑝𝜃 𝒚 𝒙 = 𝑝𝜃 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑁 𝒙 =
𝑛=1
𝑁
𝑝𝜃 𝑦𝑛 𝑦𝑛−1, 𝑦𝑛−2, … , 𝒙

7
Roozbeh Sanaei http://www.sunlab.org/teaching/cse6250/fall2222/lab/dl-rnn/
Vanishing vs exploding gradient problem

8
Roozbeh Sanaei https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be
Idea behind GRU and LSTM
• The cell state is kind of like a conveyor belt.
• It runs straight down the entire chain, with only some minor linear
interactions. It’s very easy for information to just flow along it
unchanged.
• Gates are a way to optionally let information through.

9
Roozbeh Sanaei http://colah.github.io/posts/2015-08-Understanding-LSTMs/
LSTM
Forget gate: Decide which information to throw away from the cell state.
Input gate: Decide which information to store to the cell state
Update: Update the cell state scaled by input and forget gates.
Output: Output based on the updated cell state.

10
Roozbeh Sanaei
https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be
GRU
Update
gate
Reset
gate
New
info
The update gate decides how much of the past information needs to be passed along
to the future.
The reset gate is used from the model to decide how much of the past information to
forget

11
Roozbeh Sanaei
Bidirectional RNN
• Independent RNNs
• Outputs of are concatenated at each time step
• Allows the networks to have both backward and forward information
https://towardsdatascience.com/understanding-bidirectional-
rnn-in-pytorch-5bd25a5dd66

12
Roozbeh Sanaei
Deep RNNs
https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-
recurrent-neural-networks

13
Roozbeh Sanaei
Featurized Word Representations
https://dzone.com/articles/introduction-to-word-vectors

14
Roozbeh Sanaei
Properties of word embeddings
https://towardsdatascience.com/word-embeddings-for-nlp-5b72991e01d4

15
Roozbeh Sanaei
Embedding Matrix
https://neuro.cs.ut.ee/the-use-of-embeddings-in-openai-five/
embedding Matrix Word embedding
Word Representations can be learned from large corpuses and be used
(or fine-tuned) on new tasks

16
Roozbeh Sanaei
Neural Language Model
https://x-wei.github.io/notes/xcs224n-lecture6.html

17
Roozbeh Sanaei
Word2Vec Sampling
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-
skip-gram-model/

18
Roozbeh Sanaei
Negative Sampling
https://jalammar.github.io/illustrated-word2vec/

19
Roozbeh Sanaei
GloVe: Global Vectors for Word Representations
http://building-babylon.net/tag/glove/

20
Roozbeh Sanaei
Greedy Search vs Beam Search
𝐿𝑒𝑛𝑔𝑡ℎ 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 =
1
𝐿∝

21
Roozbeh Sanaei
Beam Search Failure Analysis
Ground Truth sentence likelihood turns out to be higher
→ 𝐵𝑒𝑎𝑚 𝑆𝑒𝑎𝑟𝑐ℎ 𝑖𝑠 𝑎𝑡 𝐹𝑎𝑢𝑙𝑡
Ground Truth sentence likelihood turns out to be lower
→ 𝑅𝑁𝑁 𝑀𝑜𝑑𝑒𝑙 𝑖𝑠 𝑎𝑡 𝐹𝑎𝑢𝑙𝑡

22
Roozbeh Sanaei
Bleu Score
R1: but thou shalt love thy neighbor as thyself
R2: but have love for your neighbor as for yourself
R3: but love your neighbors as you love yourself
D: but love other love friend for love yourself
D(but)=1
D(love)=3
D(other)=1
D(friend)=1
D(for)=1
D(yourself)=1
R(but)=1
R(love)=2 [appears twice in R3]
R(other)=0
R(friend)=0
R(for)=2 [appears twice in R2]
R(yourself)=1
MIN(D(but), R(but))=MIN(1, 1)=1
MIN(D(love), R(love))=MIN(3, 2)=2
MIN(D(other), R(other))=MIN(1, 0)=0
MIN(D(friend), R(friend))=MIN(1,0)=0
MIN(D(for), R(for))=MIN(1, 2)=1
MIN(D(yourself), R(yourself))=MIN(1,1)=1
Total= 5 Total = 5
Bleu Score = 5/8
https://towardsdatascience.com/nlp-metrics-made-simple-the-
bleu-score-b06b14fbdbc1

23
Roozbeh Sanaei
Attention
https://lilianweng.github.io/lil-log/2018/06/24/attention-
attention.html

24
Roozbeh Sanaei
Self-Attention
https://towardsai.net/p/nlp/getting-meaning-from-text-self-attention-step-by-step-video
https://jalammar.github.io/illustrated-transformer/

25
Roozbeh Sanaei
Self-Attention

26
Roozbeh Sanaei
Multi-Head Self-Attention

27
Roozbeh Sanaei
Attention benefits
https://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture14-transformers.pdf
• Constant ‘path length’ between any two positions.
• Unbounded memory.
• Trivial to parallelize (per layer).
• Models Self-Similarity.
• Relative attention provides expressive timing, equivariance, and extends
naturally to graphs.

28
Roozbeh Sanaei
Transformer
https://towardsai.net/p/nlp/getting-meaning-from-text-self-
attention-step-by-step-video
• Transforms one sequence into another
one word at a time based on previous elements.
• During the training stage, each word is predicted
based on words before that in the sentence
according to Ground truth
• During the test stage, each word is predicted
based on the predicted words before that

Sequence models

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sequence models

Similar to Sequence models (20)

Recently uploaded

Recently uploaded (20)

Sequence models