Deep Learning to Text

Copyright © 2019-2021 JianKai Wang
Deep Learning to Text
JianKai Wang, 王建凱
https://jiankaiwang.no-ip.biz/
https://www.linkedin.com/in/wangjiankai/
https://github.com/jiankaiwang/sophia
from RNN to Transformer

Tasks on Text
2
Model
Input Sequence
● Classification
● Extraction
● Output Sequence
● Parts of Inputs for the bigger model
● …

What is Seq2Seq? Why is Seq2Seq?
3
Seq2Seq Model
Input Sequence
Output Sequence
Seq2Seq Model
[NMT-From]
How are you?
[Q&A-Questions]
Where is the apply form?
[Text Header]
About seq2seq, he said:
[History Values]
[0.98, 0.97, 0.96]
...
[NMT-To]
你好嗎？
[Q&A-Answers]
Under the doc button.
[Text Generation]
he said: It is awesome.
[Multistep Values]
[0.95, 0.94, 0.93]
...

Goals
● Train a model that takes an input sequence and outputs a
corresponding sequence.
● If the output sequence is going to:
○ select categories: the model would be a classification problem.
■ like Character-based / Alphabet-based
■ Like Token-based: Subwords (Tensor: Ten + sor) or Words (Tensor)
○ predict numbers: the model would be a regression problem.
○ generate feature vectors: the model could be one of the above
4

Backbone of Seq2Seq
5
<S> How are you ? <E>
你
<S>
好嗎 ? <E>
Intermediate Context
Encoder
Decoder

Time Series using RNN Cells
6
RNN
Xi
hi
RNN
X1
h0
h1
RNN
h2
X2
RNN
hn
Xn
... RNN
Xi
hi
Standard Unfolded Concept
U
W
From_RNN_to_LSTM.ipynb

RNN Issues
7
Solved by Gradient Clipping Solved by Gated Cell

Gated RNN Cell
8
RNN Cell GRU Cell
LSTM Cell
…

Bidirectional Model
9
RNN Cell
Gated Cell: LSTM, GRU
Bidirectional models
Attention models
LSTM + Bidirectional
Embedding:
m: encoder.vocabulary.size()
n: representative features
encoder /w decoder
Bidirectional:
forward + backward direction
concatenated both results

More Stacks
10
Different types of Models Bidirectional Model with more stacks
tf.keras.layers.bidirectional(
layer=tf.keras.layers.LSTM(...),
return_sequences=True,
)
Andrej Karpathy blog (2015)

It’s your turn! https://github.com/jiankaiwang/sophia
11

Seq2Seq and how attention works?
12
Encoder Decoder
<start> <start>
<end>
(1) Weighted ?
(2) Return Sequence?
(4) Predictive Position?
(3) Reference to encoder ?

Additive and Multiplicative Attention
13
TF2Keras_NMT_Attention.ipynb
Dzmitry Bahdanau, et al. (2014)

14

Global and Local Attention Model
15
Minh-Thang Luong, et al. (2015)

Decoder
Here comes Transformer!
16
RNN Cell
Gated Cell: LSTM, GRU
Bidirectional models
Attention models Input Sequence
Output Sequence
Encoder
(1) Why RNN?
Other unit?
(2) encoder-decoder model?

Positional Encoding
17
● Adding the encoding to the
embedding vector.
● Not necessary, recommended for
the text, and time series task.
TF2Keras_Transformer_LanguageUnderstanding.ipynb

Masking
18
Input Sequences
[[2,7,36,25,14,3,0,0,0],
[2,19,20,13,7,9,3,0,0],
…]
Padded Mask
[[0,0,0,0,0,0,1,1,1],
[0,0,0,0,0,0,0,1,1],
…]
● Mask all the pad tokens in the batch of
sequence.
● Make sure these padded tokens would not
be trained.
Look-ahead Mask
[[0,1,1,1,1],
[0,0,1,1,1],
[0,0,0,1,1],
[0,0,0,0,1],
[0,0,0,0,0]]
● Mask all future tokens while training.
● Predict the future token by the previous
token only.

Scaled dot product attention
19
● Query: the output from decoder in the previous attention
● Key: the input from encoder in the previous attention
● Value: the hidden from encoder in previous attention
● Mask: for the padded mask or the look ahead mask
● Scale: smart operation for softmax
● Softmax generates the relation between Q and K

Multi-head attention and Feed forward network
20
● Consists of h times scaled dot product attention.
● Each V, K, Q are put through a linear layer ﬁrst.
● In the example code, we split the dimension with
the number of the heads (h in the diagram).
E.g. 512 (dimensions) / 8 (heads) = 64 (depth)
● We concatenate the results from multiple heads by
using transpose and reshape due to the splitting
dimensions.
● The FFN is used after the multi-head attention for
summarizing.
● The encoder’s FFN is more like the hidden state for
the decoder.

Transformer with Encoder and Decoder
21
Multi-head attention with padding mask
Receiving the output
as the key (K) and the value (V)
Receiving the output from the first
attention layer as the query (Q)
The output shape:
[batch_size, dec_input, dec_vocab_size]

22

BERT Input Representation
23

BERT (Bidirectional Encoder Representations from Transformers)
24

Reference
● H Sak, et al. (2014) Long short-term memory recurrent neural network architectures for large scale
acoustic modeling. INTERSPEECH
● Dzmitry Bahdanau, et al. (2014) Neural Machine Translation by Jointly Learning to Align and Translate
arXiv:1409.0473
● Minh-Thang Luong, et al. (2015) Eﬀective Approaches to Attention-based Neural Machine Translation
arXiv:1508.04025
● Ashish Vaswani, et al. (2017) Attention Is All You Need arXiv:1706.03762v5
● Jacob Devlin, et al. (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding arXiv:1810.04805
● Retrieved 2022, from https://www.tensorﬂow.org/
25

Take-Home Message
● From RNN, LSTM, attention, to transformer, this is the path
for deep learning on text.
● From the earlier research, we can understand how to
enhance the unit or the layer in the model.
JianKai Wang, 王建凱
gljankai@gmail.com
https://jiankaiwang.no-ip.biz/
https://www.linkedin.com/in/wangjiankai/
26

Deep Learning to Text

Recommended

Recommended

More Related Content

Similar to Deep Learning to Text

Similar to Deep Learning to Text (20)

More from Jian-Kai Wang

More from Jian-Kai Wang (11)

Recently uploaded

Recently uploaded (20)

Deep Learning to Text