SlideShare a Scribd company logo
1 of 5
Download to read offline
From Seq2Seq to BERT
Huali Z
ABSTRACT
In the end of 2018, BERT[5] caused a stir in machine learning with
7.7% point improvement on GLUE(General Langugage Understand-
ing Evaluation) score and 5.1 point improvement on SQuAD(Stanford
Question Answering Dataset) v2.0. Increasing numbers of pre-
training models have been published based on BERT. Records of
model scores on SQuAD, GLUE and MultiNLI(Multi-Genre Natural
Language Inference) has been refreshed every year. The nitty-gritty
of BERT is Transformer[16] and bi-directional training. For Trans-
former, Attention is all you need [16]. It is only 6 years of time since
Seq2Seq[15] model was published in 2014. How the models evolved
to consume less computer power, to enable parallel computing and
to get better scores? This paper illustrates a short history of the
NLP models evolution step by step. From Seq2Seq model to BERT,
from intermediate fixed size vector to Transformer. It also connects
the dots of lingoes like Seq2Seq, fixed size vector, attention, soft
alignment, attention, Transformer, pre-training, fine-tuning, and
BERT.
1 INTRODUCTION
Staring with Seq2Seq which solved the problem that Deep Neural
Networks can only be applied to problem who’s input and targets
can be sensibly encoded with vectors of fixed dimensionality[15],
this paper states a short evolution history of NLP models from 2014
to 2020 when variants of BERT models are published.
Seq2Seq model proposed a structure with a fixed size vector be-
tween input and output layer. This structure fixed the dimension-
ality problem of deep neural networks by the intermediate vector.
When it comes to long sentences, this fixed size vector can be a
bottleneck for performance.
Attention model introduced a encoder and decoder structure with
soft alignment. After the attention model, different types of atten-
tion models are innovated. Soft attention works well on getting
context information, however it is expensive. Hard attention is
less expensive but requires more complicated techniques. Global
attention takes into all hidden states, where as local attention only
takes in a subset of hidden states. Self attention is using positions
of a sequence to produce encoding a word in the same sequence.
Attention can also be categorized by attention score calculation
methods. Dot production attention and concat attention are most
used methods. Vaswani’s research has shown that dot production
attention is much faster and more space efficient.
RNNs in attention models can be an obstacle for parallelism and
resource efficiency. Transformer solves the problem by relying on
self attention completely without RNNs. Along with self attention,
Transformer also combines the techniques like scaled-dot attention,
multi-head attention, position encoding and masking to achieve
the best result.
Based on Transformer, BERT was innovated for pre-training. The
ground breaking feature of BERT is bidirectional training of Trans-
former. It allows BERT to understand context better and enables
information flow between layers. Variants of BERT has been pub-
lished after BERT standing under the spotlight. Models like Span-
BERT and SesameBERT improves the attention span to get better
understanding of context. ALBERT, DistilBERT, TinyBERT and Mo-
bileBERT improves cost efficiency while keeps same performance.
SciBERT and BioBERT gives pre-trained model for domain specific
NLP tasks.
2 SEQ2SEQ
Deep neural networks has shown its power on parallel computation
and back-propagation. At the same time, it has limitation on the
dimensionality on input and outputs. The input and output has to
be encoded with fixed size vectors. In real life, for many problems
the best expression of input and output vector sizes are not known
ahead of the time. For example, sequence problem like voice recog-
nition.
Seq2Seq model[15] was created for language modeling. It is nor-
mally used for translation, summary extraction and question answer
bots . The model contains two parts: encoder and decoder. Encoder
maps the input sequence into a fixed size vector and decoder out-
puts a sequence from the vector. Both the encoder and decoder
are multi-layered LSTM. The goal for the LSTM is to estimates the
conditional probability 𝑝(𝑦1, ...,𝑦𝑇′ |π‘₯1, ...,π‘₯𝑇 ) [15] where the input
sequence length T is different from the output sequence length 𝑇
β€²
.
The mathematical representation of this model is:
𝑝(𝑦1, ...,𝑦𝑇′ |π‘₯1, ...,π‘₯𝑇 ) =
𝑇′
Γ–
𝑖=1
𝑝(𝑦𝑑 |𝑣,𝑦1, ...,π‘¦π‘‘βˆ’1) (1)
In the equation, v is a fixed sized vector which produced by the
last hidden state of input LSTM. This fixed size vector is being used
as the initial input of the output LSTM. Each p(y_t | v, y_1, ... , y_t-1)
is represented with a softmax all over the words in the volcabulary.
[15]
There are 3 features which are not included in the math representa-
tion.
(1) Two different LSTM is used for input sequence and output se-
quence respectively. It helps on the parallelism while training
the model with different languages with small computational
cost.
(2) A deep LSTM with 4 layers is used because deep LSTM works
better than shallow LSTM in this case.
(3) Reversing the words of input sequence introduced short term
dependencies which improves the model’s performance on
long sentences.
3 ATTENTION
The fixed size vector in Seq2Seq model enables modeling without
knowing input and output sequence size. However, when it comes
to long sentences, it is a bottleneck for improving performance.
Because all information is compressed into the fixed size vector, the
beginning part of the information is forgotten. Bahdanau et all[2]
Huali Z
Figure 1: Attention architecture [2]
proposed a new mechanism which abort the fixed size vector and
introduced a new architecture of encoder and decoder.
As Figure 1 illustrates, the encoder converts input sequence x
into annotations h by a bi-directional RNN. These annotations are
sum weighted into a context vector c. Using the context vector c,
along with the previous hidden state π‘ π‘–βˆ’1 and previous output π‘¦π‘–βˆ’1,
the decoder computes out the output 𝑦𝑑 . Using bi-directional RNN
enables the encoder to calculates hidden states in both directions. So
for each wordπ‘₯𝑗 the annotation is a concatenationβ„Žπ‘— = [
βˆ’
βˆ’
β†’
β„Žπ‘—π‘‡ :
←
βˆ’
βˆ’
β„Žπ‘—π‘‡ ].
The equation of this model can be written as:
𝑝(𝑦𝑖 |𝑦1, ...,π‘¦π‘–βˆ’1,𝑋) = 𝑔(π‘¦π‘–βˆ’1,𝑠𝑖,𝑐𝑖) (2)
where:
𝑠𝑖 = 𝑓 (π‘ π‘–βˆ’1,π‘¦π‘–βˆ’1,𝑐𝑖) (3)
𝑐𝑖 =
𝑇π‘₯
Γ•
𝑗=1
π‘Žπ‘–π‘—β„Žπ‘— (4)
π‘Žπ‘–π‘— =
exp𝑒𝑖𝑗
Í𝑇π‘₯
π‘˜=1
expπ‘’π‘–π‘˜
(5)
𝑒𝑖𝑗 = π‘Ž(π‘ π‘–βˆ’1,β„Žπ‘— ) (6)
𝑠𝑖 in equation 3 is the hidden state of the RNN at time i. 𝑐𝑖 in
equation 4 is the context vector which is a weighted sum of the an-
notation β„Žπ‘— . The annotations sequence (β„Ž1, ...,β„Žπ‘‡π‘₯ ) is mapped from
the input sequence. Each annotation β„Žπ‘– contains the information of
all the inputs with a strong focus on the words around i-th word.
The weight π‘Žπ‘–π‘— in equation 5 is calculated by an alignment model
in equation 6. The alignment model scores how well the inputs
around position j and the output at position i match. It is trained
and optimized along with the whole translation model jointly[2].
From the equations, we can see that the next output 𝑦𝑑 and hidden
state 𝑠𝑑 is depended on the previous hidden state π‘ π‘‘βˆ’1 and annota-
tion β„Žπ‘— . This design implements the attention mechanism in the
decoder. With attention, the fixed size vector is no longer needed
in this whole model. Instead, a soft alignment between the input
and output is learnt. This improves the model’s performance with
long sentences significantly.
3.1 Attention Family
Depends on how the alignment score is calculated, attention mecha-
nism can be categorized to dot-product[11], scaled dot-product[16],
concat[2], general and others. In a broader categories attention
mechanisms, there is global attention[11], local attention[11], hard
attention[11], soft attention[2] and self attention[4].
3.1.1 hard vs soft. In a soft attention model, the alignment vector is
learnt by an alignment model. See Equation 6. The alignment model
directly computes a soft alignment, which allows the gradient of the
cost function to be back-propagated through. This gradient can be
used to train the alignment model as well as the whole translation
model jointly[2]. For example, in a English to French translation
task, soft alignment allows the model to translate the word ’the’ to
’le’ and ’la’ dynamically based on the context word. The drawback
of this model is when input sequence is long, the computation is
expensive.
In hard attention, the alignment vector π‘Žπ‘‘ is learnt with a select part
of the input sequence. It is less expensive than the soft attention
model but it is non-differentiable and requires more complicated
techniques such as variance reduction or reinforcement learning to
train[11].
3.1.2 global vs local. The idea of global attetntion model is to
consider all the hidden states of the encoder when deriving the
context vector 𝑐𝑑 [11]. As Figure 2 shows, all the hidden states are
used to calculate the alignment vector π‘Žπ‘‘ at time t. Then the global
context vector 𝑐𝑑 is generated by doing weighted average on each
π‘Žπ‘‘ .
Using all the hidden states can be expensive and can potentially
render it impractical to translate longer sequences, e.g. paragraphs
or documents[11]. Local attention chose to pay attention to only
a subset of the hidden states to get the weight vector π‘Žπ‘‘ . Figure 3
illustrates that the alignment vector π‘Žπ‘‘ is got from part of the input
hidden states. These input hidden states are within a window [𝑝𝑑 -
D, 𝑝𝑑 +D]. 𝑝𝑑 is aligned position and D is a selected value. The
contet vector 𝑐𝑑 is a weighted averaged value over π‘Žπ‘‘ . Unlike the
global attention model, π‘Žπ‘‘ now is a fixed size vector. To avoid the
non-differentiable alignment problem in the hard attention model.
This local attention model predicts the aligned position 𝑝𝑑 using
equation 7 where the model parameters 𝑀𝑝 and 𝑣𝑝 will be learnt
for predicting 𝑝𝑑 .
𝑝𝑑 = 𝑆 Γ— π‘†π‘–π‘”π‘šπ‘œπ‘–π‘‘(𝑣𝑇
𝑝 π‘‘π‘Žπ‘›β„Ž(π‘Šπ‘β„Žπ‘‘ )) (7)
3.1.3 self attention. Self attention is also called intra-attention[4].
It is a mechanism using ’attentioned’ positions of a sequence to
produce encoding of a word in the same sequence. Figure 4 shows
how the much attention is paid in a machine reading task using self
attention. The model processes text incrementally while learning
which past tokens in the memory and to what extent they relate to
the current token being processed[4].
3.1.4 alignments score calculation methods. There are variant ways
of calculating alignment score. Score here means how well a source
sequence hidden state and a target sequence hidden state is aligned.
Equation 5 is a good example of alignment score calculation method.
Bahdanau’s research employed a feed forward network which is the
From Seq2Seq to BERT
Figure 2: Global attention[11]
Figure 3: Local attention in machine reading task[11]
Figure 4: self attention[4]
a in equation 6 to derive the alignment score. This way of getting the
alignment score is referenced as concat in Luong’s research[11] and
additive addition in Vaswani’s paper[16]. Beside this method, Lu-
ong’s research also explored ’dot production’ π‘ π‘π‘œπ‘Ÿπ‘’(β„Žπ‘‘,β„Žπ‘ ) = β„Žπ‘‡
𝑑
Β―
β„Žπ‘ 
[11] , ’general’ π‘ π‘π‘œπ‘Ÿπ‘’(β„Žπ‘‘,β„Žπ‘ ) = β„Žπ‘‡
𝑑 π‘Šπ‘Ž
Β―
β„Žπ‘  [11] and ’location based’
function π‘Žπ‘‘,𝑖 = π‘ π‘œπ‘“ π‘‘π‘šπ‘Žπ‘₯(π‘Šπ‘Žβ„Žπ‘‘ ) [11]. The research shows that the
concat method is no better than the others. The dot production
attention and concat attention are most used methods for getting
alignment scores. Dot production attention is much faster and
Figure 5: Scaled dot attention[16]
more space efficient in practice, since it can be implemented us-
ing highly optimized matrix multiplication code[16]. Vaswani’s
research introduced a new method called ’scaled dot production’
π‘ π‘π‘œπ‘Ÿπ‘’(β„Žπ‘‘,β„Žπ‘ ) =
β„Žπ‘‡
𝑑
Β―
β„Žπ‘ 
√
𝑛
which is similar to ’dot production’ but with
the source dimension as a scaling factor. Adding this scaling fac-
tor helps the model dealing with the large values of the source
dimension.
4 TRANSFORMER
Using RNNs in sequential modeling was no doubt a state of art
approach before the Transformer architecture. RNNs can capture
the coherence in the sequence with hidden states. However, this in-
herently sequential nature precludes parallelization within training
examples, which becomes critical at longer sequence lengths, as
memory constrains limit batching across examples[16]. Vaswani’s
work proposed a Transformer model which is relying on self atten-
tion mechanism entirely without RNNs.
4.1 Transformer components
4.1.1 Transformer’s self scaled-dot attention. The attention mech-
anism firstly creates three vectors: Q(query), K(key), and V(value)
for each word embedding. These vectors are created by multiplying
the input embedding with three metrics π‘€π‘ž, π‘€π‘˜, and 𝑀𝑣. It then
calculates the score for each word in the input sentence, divides
the the scores by the square root of the K dimension, softmax the
dividended score, and finally multiples V vector with the sfotmaxed
score. The whole process is shown as in Equation 8[16] and Figure
5[16]. Using scaled dot function helps on getting more stabled gra-
dients and softmax function makes sure the scores at each input
sequence position will add up to 1.
π΄π‘‘π‘‘π‘’π‘›π‘‘π‘–π‘œπ‘›(𝑄, 𝐾,𝑉 ) = π‘†π‘œπ‘“ π‘‘π‘šπ‘Žπ‘₯(
𝑄𝐾𝑇
p
π‘‘π‘˜
)𝑉 (8)
4.1.2 Multi-head attention. As Figure 6 shows, multi-head atten-
tion linearly projects keys, values and queries h times. For each
projected version of K, V and Q pairs, attention layer applied on.
Each attention layer produces output values in parallel for each
version of projected K, V and Q. These outputs are concatenated
Huali Z
Figure 6: multi-head attention[16]
and linear projected to get the final value. By doing this multi-head
attention, the model increase its ability to focus on different parts of
input sequence by taking in information from different representa-
tions jointly. The equation of this multi-head attention mechanism
can be written as equation 9[16].
π‘€π‘’π‘™π‘‘π‘–π»π‘’π‘Žπ‘‘(𝑄, 𝐾,𝑉 ) = πΆπ‘œπ‘›π‘π‘Žπ‘‘(β„Žπ‘’π‘Žπ‘‘1, ...,β„Žπ‘’π‘Žπ‘‘β„Ž)π‘Šπ‘œ
(9)
where
β„Žπ‘’π‘Žπ‘‘π‘– = π΄π‘‘π‘‘π‘’π‘›π‘‘π‘–π‘œπ‘›(π‘„π‘Š
𝑄
𝑖 , πΎπ‘Š 𝐾
𝑖 ,π‘‰π‘Šπ‘‰
𝑖 ) (10)
4.1.3 position encoding. Removing the RNNs from the model dis-
carded the hidden states which can indicates the inherent position
information. To bring in the position information, Transformer in-
jects positional encoding in the input embedings for both encoder
and decoder. It adds the absolute or relative position information
into the projected K, V and Q vectors.
4.1.4 masking. The self attention layer in decoder is different from
the one in encoder. Masking is applied in the decoder self attention
layer to ensure the decoder prediction only takes in words in earlier
positions.
4.2 Transformer architecture
The architecture of Transformer is illustrated in Figure7. Like
Seq2Seq model and other attention models, Transformer also has
encoder and decoder structure. Both encoder and decode is com-
posed by 6 identical layers. 6 is a chosen empirical number. All the
encoder layers contain two sub-layers: multi-head self attention
layer and feed forward network layer. After positional encoding,
the input sequence flows into the multi-attention layer so the model
can encode a word with its surrounding words. The output of the
multi-attention layer is fed into a feed forward neural network
which is applied to each position. The decoder has similar structure
as the encoder but with one more multi-head attention layer to
take in the output from the encoder. The output of the decoder will
pass through a linear and a softmax layer so the output vector can
be mapped to a word.
Figure 7: Transformer[16]
Figure 8: BERT framework [5]
5 BERT
BERT, which stands for Bidirectional Encoder Representations from
Transformers[5], is a pre-training transformer encoder which im-
proves fine-tuning by bidirectional training. When applying the pre-
trained sequence to a specific downstream task, there are two main
strategies can be used: feature-based and fine-tuning. Feature-based
strategy is using the pre-trained representations as an extra feature.
Fine-tuning is simply fine-tuning the task specific parameters. The
pre-training models before BERT are unidirectional models which
restricts the performance of context understanding. BERT intro-
duced a state of art MLM(masked language model) along with "next
sentence prediction" to enable bidirectional representation. The
masked language model randomly masks some of the tokens from
the input, and the objective is to predict the original vocabulary id
of the masked word based only on its context[5].
From Seq2Seq to BERT
5.1 Optimized BERT
5.1.1 Improved training procedure. After evaluating BERT param-
eters with datasets, Liu et all, found that BERT is significantly
undertrained. RoBERTa(Robustly optimized BERT approach) [10]
is created for providing a better way for training BERT models.
This approach includes (1) training the model longer, with bigger
batches, over more data; (2) removing the next sentence prediction
objective; (3) training on longer sequences; and (4) dynamically
changing the masking pattern applied to the training data[10]. With
the approaches, RoBERTa get improved performance when com-
paring with 𝐡𝐸𝑅𝑇𝐿𝐴𝑅𝐺𝐸.
5.1.2 Improved attention span. SpanBERT[7] is an extended BERT
with randomly masked spans and SBO(span boundary objective).
Comparing with the masked tokens in vanilla BERT, masked span
provides a better context representation. SBO enables the model
learns to predict the entire masked span from the observed tokens
at its boundary without relying on the individual token representa-
tions within it[7].
SesameBERT[13] enables information flow across BERT stacked lay-
ers through Squeeze and Excitation. It also enriches local informa-
tion by captureing neighboring context via Gaussian blurring[13].
5.1.3 Improved cost efficiency. As pre-training model size grows
for better performance in fine-tuning, memory limitation of avail-
able hardware is an obstacle for getting a better scaled pre-trained
model. The existing model parallelization and clever memory man-
agement method can mitigate the memory shortage issue but not
communication overhead. ALBERT(A lite BERT)[8] is born for all
the problems. ALBERT provides a BERT model with less parameters
so it consumes lower memory and runs in a faster speed. There
are two parameter reduction strategies used in ALBERT. The first
one is a factorized embedding parameterization. The second one is
cross-layer parameter sharing [8].
To alleviate the resource limitation, DistilBERT [12] introduce a
triple loss combining language modeling, distillation and cosine-
distance losses[12]. It reduces BERT model size by 40% and speeds
up the model 60% while keeping 97% of the model’s understand-
ing ability. Similar modified BERT models are MobileBERT [14]
and TinyBERT[6]. MobileBERT’s goal is building a lightweight
pre-trained model for different downstream tasks. TinyBERT is 7.5
times smaller and 9.4 faster with a novel Transformer distillation
method and a two-stage learning method.
5.1.4 Domain oriented improvement. On specific domains, lack of
high quality and larges scaled data is limiting BERT performance.
Domain-specific pre-trained BERT models are published for provid-
ing generic but specific language representation models. For exam-
ple, SciBERT[3] and BioBERT[9] is trained on large multi-domain
scientific publication corpora and biomedical corpora respectively.
6 CONCLUSION
Exponential number of research is done with variants of BERTs in
two years of time. While reading the papers, understating concepts
like Attention helps to know what a model is based on. This pa-
per introduces a short history of NLP models: Seq2Seq, Attention,
Transformer and BERT. All these models are aiming at getting bet-
ter performance and being less time and memory expensive at the
same time.
REFERENCES
[1] Jay Alammar. 2019. The illustrated Transformer. http://jalammar.github.io/
illustrated-transformer/
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. NEURAL MA-
CHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE.
(2015). arXiv:1409.0473v7
[3] Iz Beltagy, Kyle Lo, and Arman Cohan. 2020. SCIBERT: A pretrained language
model for scientific text. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical
Methods in Natural Language Processing and 9th International Joint Conference
on Natural Language Processing, Proceedings of the Conference (2020), 3615–3620.
arXiv:arXiv:1903.10676v3
[4] Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long Short-Term Memory-
Networks for Machine Reading. (jan 2016). arXiv:1601.06733 http://arxiv.org/
abs/1601.06733
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding.
(oct 2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805
[6] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang,
and Qun Liu. 2019. Tinybert: Distilling bert for natural language understanding.
arXiv (2019). arXiv:arXiv:1909.10351v5
[7] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and
Omer Levy. 2019. SpanBERT: Improving Pre-training by Representing and
Predicting Spans. (jul 2019). arXiv:1907.10529 http://arxiv.org/abs/1907.10529
[8] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush
Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised
Learning of Language Representations. (sep 2019). arXiv:1909.11942 http:
//arxiv.org/abs/1909.11942
[9] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim,
Chan Ho So, and Jaewoo Kang. 2020. BioBERT: A pre-trained biomedical language
representation model for biomedical text mining. Bioinformatics 36, 4 (2020),
1234–1240. https://doi.org/10.1093/bioinformatics/btz682 arXiv:1901.08746
[10] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A
Robustly Optimized BERT Pretraining Approach. (jul 2019). arXiv:1907.11692
http://arxiv.org/abs/1907.11692
[11] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effec-
tive Approaches to Attention-based Neural Machine Translation. (aug 2015).
arXiv:1508.04025 http://arxiv.org/abs/1508.04025
[12] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distil-
BERT, a distilled version of BERT: smaller, faster, cheaper and lighter. (oct 2019).
arXiv:1910.01108 http://arxiv.org/abs/1910.01108
[13] Ta-Chun Su and Hsiang-Chih Cheng. 2019. SesameBERT: Attention for Any-
where. (oct 2019). arXiv:1910.03176 http://arxiv.org/abs/1910.03176
[14] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny
Zhou. 2020. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited
Devices. (apr 2020). arXiv:2004.02984 http://arxiv.org/abs/2004.02984
[15] Ilya Sutskever Google, Oriol Vinyals Google, and Quoc V Le Google. 2014. Se-
quence to Sequence Learning with Neural Networks. Technical Report.
[16] Ashish Vaswani, Google Brain, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention
Is All You Need. Technical Report. arXiv:1706.03762v5
[17] Lilian Weng. 2019. Attention? Attention! https://lilianweng.github.io/lil-log/
2018/06/24/attention-attention.html

More Related Content

What's hot

A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...ijaia
Β 
Real interpolation method for transfer function approximation of distributed ...
Real interpolation method for transfer function approximation of distributed ...Real interpolation method for transfer function approximation of distributed ...
Real interpolation method for transfer function approximation of distributed ...TELKOMNIKA JOURNAL
Β 
Duality in nonlinear fractional programming problem using fuzzy programming a...
Duality in nonlinear fractional programming problem using fuzzy programming a...Duality in nonlinear fractional programming problem using fuzzy programming a...
Duality in nonlinear fractional programming problem using fuzzy programming a...ijscmcj
Β 
llorma_jmlr copy
llorma_jmlr copyllorma_jmlr copy
llorma_jmlr copyGuy Lebanon
Β 
Penalty Function Method For Solving Fuzzy Nonlinear Programming Problem
Penalty Function Method For Solving Fuzzy Nonlinear Programming ProblemPenalty Function Method For Solving Fuzzy Nonlinear Programming Problem
Penalty Function Method For Solving Fuzzy Nonlinear Programming Problempaperpublications3
Β 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesijsc
Β 
MIXED 0βˆ’1 GOAL PROGRAMMING APPROACH TO INTERVAL-VALUED BILEVEL PROGRAMMING PR...
MIXED 0βˆ’1 GOAL PROGRAMMING APPROACH TO INTERVAL-VALUED BILEVEL PROGRAMMING PR...MIXED 0βˆ’1 GOAL PROGRAMMING APPROACH TO INTERVAL-VALUED BILEVEL PROGRAMMING PR...
MIXED 0βˆ’1 GOAL PROGRAMMING APPROACH TO INTERVAL-VALUED BILEVEL PROGRAMMING PR...cscpconf
Β 
Solving Scheduling Problems as the Puzzle Games Using Constraint Programming
Solving Scheduling Problems as the Puzzle Games Using Constraint ProgrammingSolving Scheduling Problems as the Puzzle Games Using Constraint Programming
Solving Scheduling Problems as the Puzzle Games Using Constraint Programmingijpla
Β 
Using Grid Puzzle to Solve Constraint-Based Scheduling Problem
Using Grid Puzzle to Solve Constraint-Based Scheduling ProblemUsing Grid Puzzle to Solve Constraint-Based Scheduling Problem
Using Grid Puzzle to Solve Constraint-Based Scheduling Problemcsandit
Β 
Sequential order vs random order in operators of variable neighborhood descen...
Sequential order vs random order in operators of variable neighborhood descen...Sequential order vs random order in operators of variable neighborhood descen...
Sequential order vs random order in operators of variable neighborhood descen...TELKOMNIKA JOURNAL
Β 
Ds03 algorithms jyoti lakhani
Ds03 algorithms jyoti lakhaniDs03 algorithms jyoti lakhani
Ds03 algorithms jyoti lakhanijyoti_lakhani
Β 
Mca2020 advanced data structure
Mca2020  advanced data structureMca2020  advanced data structure
Mca2020 advanced data structuresmumbahelp
Β 
Evolutionary Algorithmical Approach for VLSI Physical Design- Placement Problem
Evolutionary Algorithmical Approach for VLSI Physical Design- Placement ProblemEvolutionary Algorithmical Approach for VLSI Physical Design- Placement Problem
Evolutionary Algorithmical Approach for VLSI Physical Design- Placement ProblemIDES Editor
Β 
Solving linear equations from an image using ann
Solving linear equations from an image using annSolving linear equations from an image using ann
Solving linear equations from an image using anneSAT Journals
Β 
Unit8: Uncertainty in AI
Unit8: Uncertainty in AIUnit8: Uncertainty in AI
Unit8: Uncertainty in AITekendra Nath Yogi
Β 
Dynamic programming class 16
Dynamic programming class 16Dynamic programming class 16
Dynamic programming class 16Kumar
Β 
Algorithms : Introduction and Analysis
Algorithms : Introduction and AnalysisAlgorithms : Introduction and Analysis
Algorithms : Introduction and AnalysisDhrumil Patel
Β 

What's hot (20)

A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
Β 
Real interpolation method for transfer function approximation of distributed ...
Real interpolation method for transfer function approximation of distributed ...Real interpolation method for transfer function approximation of distributed ...
Real interpolation method for transfer function approximation of distributed ...
Β 
Duality in nonlinear fractional programming problem using fuzzy programming a...
Duality in nonlinear fractional programming problem using fuzzy programming a...Duality in nonlinear fractional programming problem using fuzzy programming a...
Duality in nonlinear fractional programming problem using fuzzy programming a...
Β 
llorma_jmlr copy
llorma_jmlr copyllorma_jmlr copy
llorma_jmlr copy
Β 
Penalty Function Method For Solving Fuzzy Nonlinear Programming Problem
Penalty Function Method For Solving Fuzzy Nonlinear Programming ProblemPenalty Function Method For Solving Fuzzy Nonlinear Programming Problem
Penalty Function Method For Solving Fuzzy Nonlinear Programming Problem
Β 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniques
Β 
Dynamic pgmming
Dynamic pgmmingDynamic pgmming
Dynamic pgmming
Β 
MIXED 0βˆ’1 GOAL PROGRAMMING APPROACH TO INTERVAL-VALUED BILEVEL PROGRAMMING PR...
MIXED 0βˆ’1 GOAL PROGRAMMING APPROACH TO INTERVAL-VALUED BILEVEL PROGRAMMING PR...MIXED 0βˆ’1 GOAL PROGRAMMING APPROACH TO INTERVAL-VALUED BILEVEL PROGRAMMING PR...
MIXED 0βˆ’1 GOAL PROGRAMMING APPROACH TO INTERVAL-VALUED BILEVEL PROGRAMMING PR...
Β 
Solving Scheduling Problems as the Puzzle Games Using Constraint Programming
Solving Scheduling Problems as the Puzzle Games Using Constraint ProgrammingSolving Scheduling Problems as the Puzzle Games Using Constraint Programming
Solving Scheduling Problems as the Puzzle Games Using Constraint Programming
Β 
Using Grid Puzzle to Solve Constraint-Based Scheduling Problem
Using Grid Puzzle to Solve Constraint-Based Scheduling ProblemUsing Grid Puzzle to Solve Constraint-Based Scheduling Problem
Using Grid Puzzle to Solve Constraint-Based Scheduling Problem
Β 
Sequential order vs random order in operators of variable neighborhood descen...
Sequential order vs random order in operators of variable neighborhood descen...Sequential order vs random order in operators of variable neighborhood descen...
Sequential order vs random order in operators of variable neighborhood descen...
Β 
Ds03 algorithms jyoti lakhani
Ds03 algorithms jyoti lakhaniDs03 algorithms jyoti lakhani
Ds03 algorithms jyoti lakhani
Β 
Mca2020 advanced data structure
Mca2020  advanced data structureMca2020  advanced data structure
Mca2020 advanced data structure
Β 
Evolutionary Algorithmical Approach for VLSI Physical Design- Placement Problem
Evolutionary Algorithmical Approach for VLSI Physical Design- Placement ProblemEvolutionary Algorithmical Approach for VLSI Physical Design- Placement Problem
Evolutionary Algorithmical Approach for VLSI Physical Design- Placement Problem
Β 
Solving linear equations from an image using ann
Solving linear equations from an image using annSolving linear equations from an image using ann
Solving linear equations from an image using ann
Β 
Bm35359363
Bm35359363Bm35359363
Bm35359363
Β 
Unit8: Uncertainty in AI
Unit8: Uncertainty in AIUnit8: Uncertainty in AI
Unit8: Uncertainty in AI
Β 
Dynamic programming class 16
Dynamic programming class 16Dynamic programming class 16
Dynamic programming class 16
Β 
Algorithms : Introduction and Analysis
Algorithms : Introduction and AnalysisAlgorithms : Introduction and Analysis
Algorithms : Introduction and Analysis
Β 
IUI 2016 Presentation Slide
IUI 2016 Presentation SlideIUI 2016 Presentation Slide
IUI 2016 Presentation Slide
Β 

Similar to From_seq2seq_to_BERT

240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptxthanhdowork
Β 
IRJET- Comparison for Max-Flow Min-Cut Algorithms for Optimal Assignment Problem
IRJET- Comparison for Max-Flow Min-Cut Algorithms for Optimal Assignment ProblemIRJET- Comparison for Max-Flow Min-Cut Algorithms for Optimal Assignment Problem
IRJET- Comparison for Max-Flow Min-Cut Algorithms for Optimal Assignment ProblemIRJET Journal
Β 
Quality Prediction in Fingerprint Compression
Quality Prediction in Fingerprint CompressionQuality Prediction in Fingerprint Compression
Quality Prediction in Fingerprint CompressionIJTET Journal
Β 
Sign Detection from Hearing Impaired
Sign Detection from Hearing ImpairedSign Detection from Hearing Impaired
Sign Detection from Hearing ImpairedIRJET Journal
Β 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...gerogepatton
Β 
IRJET- K-SVD: Dictionary Developing Algorithms for Sparse Representation ...
IRJET-  	  K-SVD: Dictionary Developing Algorithms for Sparse Representation ...IRJET-  	  K-SVD: Dictionary Developing Algorithms for Sparse Representation ...
IRJET- K-SVD: Dictionary Developing Algorithms for Sparse Representation ...IRJET Journal
Β 
A study on the techniques for speech to speech translation
A study on the techniques for speech to speech translationA study on the techniques for speech to speech translation
A study on the techniques for speech to speech translationIRJET Journal
Β 
Medical diagnosis classification
Medical diagnosis classificationMedical diagnosis classification
Medical diagnosis classificationcsandit
Β 
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...cscpconf
Β 
TEXT GENERATION WITH GAN NETWORKS USING FEEDBACK SCORE
TEXT GENERATION WITH GAN NETWORKS USING FEEDBACK SCORETEXT GENERATION WITH GAN NETWORKS USING FEEDBACK SCORE
TEXT GENERATION WITH GAN NETWORKS USING FEEDBACK SCOREIJCI JOURNAL
Β 
X-RECOSA: MULTI-SCALE CONTEXT AGGREGATION FOR MULTI-TURN DIALOGUE GENERATION
X-RECOSA: MULTI-SCALE CONTEXT AGGREGATION FOR MULTI-TURN DIALOGUE GENERATIONX-RECOSA: MULTI-SCALE CONTEXT AGGREGATION FOR MULTI-TURN DIALOGUE GENERATION
X-RECOSA: MULTI-SCALE CONTEXT AGGREGATION FOR MULTI-TURN DIALOGUE GENERATIONIJCI JOURNAL
Β 
Particle Swarm Optimization in the fine-tuning of Fuzzy Software Cost Estimat...
Particle Swarm Optimization in the fine-tuning of Fuzzy Software Cost Estimat...Particle Swarm Optimization in the fine-tuning of Fuzzy Software Cost Estimat...
Particle Swarm Optimization in the fine-tuning of Fuzzy Software Cost Estimat...Waqas Tariq
Β 
An Efficient Multiplierless Transform algorithm for Video Coding
An Efficient Multiplierless Transform algorithm for Video CodingAn Efficient Multiplierless Transform algorithm for Video Coding
An Efficient Multiplierless Transform algorithm for Video CodingCSCJournals
Β 
AMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLTAMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLTIRJET Journal
Β 
Exploring the Impact of Magnitude- and Direction-based Loss Function on the P...
Exploring the Impact of Magnitude- and Direction-based Loss Function on the P...Exploring the Impact of Magnitude- and Direction-based Loss Function on the P...
Exploring the Impact of Magnitude- and Direction-based Loss Function on the P...Dr. Amarjeet Singh
Β 
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ijsc
Β 
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ijsc
Β 
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONEXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONijaia
Β 
Boosting auxiliary task guidance: a probabilistic approach
Boosting auxiliary task guidance: a probabilistic approachBoosting auxiliary task guidance: a probabilistic approach
Boosting auxiliary task guidance: a probabilistic approachIAESIJAI
Β 
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ijsc
Β 

Similar to From_seq2seq_to_BERT (20)

240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx
Β 
IRJET- Comparison for Max-Flow Min-Cut Algorithms for Optimal Assignment Problem
IRJET- Comparison for Max-Flow Min-Cut Algorithms for Optimal Assignment ProblemIRJET- Comparison for Max-Flow Min-Cut Algorithms for Optimal Assignment Problem
IRJET- Comparison for Max-Flow Min-Cut Algorithms for Optimal Assignment Problem
Β 
Quality Prediction in Fingerprint Compression
Quality Prediction in Fingerprint CompressionQuality Prediction in Fingerprint Compression
Quality Prediction in Fingerprint Compression
Β 
Sign Detection from Hearing Impaired
Sign Detection from Hearing ImpairedSign Detection from Hearing Impaired
Sign Detection from Hearing Impaired
Β 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
Β 
IRJET- K-SVD: Dictionary Developing Algorithms for Sparse Representation ...
IRJET-  	  K-SVD: Dictionary Developing Algorithms for Sparse Representation ...IRJET-  	  K-SVD: Dictionary Developing Algorithms for Sparse Representation ...
IRJET- K-SVD: Dictionary Developing Algorithms for Sparse Representation ...
Β 
A study on the techniques for speech to speech translation
A study on the techniques for speech to speech translationA study on the techniques for speech to speech translation
A study on the techniques for speech to speech translation
Β 
Medical diagnosis classification
Medical diagnosis classificationMedical diagnosis classification
Medical diagnosis classification
Β 
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
Β 
TEXT GENERATION WITH GAN NETWORKS USING FEEDBACK SCORE
TEXT GENERATION WITH GAN NETWORKS USING FEEDBACK SCORETEXT GENERATION WITH GAN NETWORKS USING FEEDBACK SCORE
TEXT GENERATION WITH GAN NETWORKS USING FEEDBACK SCORE
Β 
X-RECOSA: MULTI-SCALE CONTEXT AGGREGATION FOR MULTI-TURN DIALOGUE GENERATION
X-RECOSA: MULTI-SCALE CONTEXT AGGREGATION FOR MULTI-TURN DIALOGUE GENERATIONX-RECOSA: MULTI-SCALE CONTEXT AGGREGATION FOR MULTI-TURN DIALOGUE GENERATION
X-RECOSA: MULTI-SCALE CONTEXT AGGREGATION FOR MULTI-TURN DIALOGUE GENERATION
Β 
Particle Swarm Optimization in the fine-tuning of Fuzzy Software Cost Estimat...
Particle Swarm Optimization in the fine-tuning of Fuzzy Software Cost Estimat...Particle Swarm Optimization in the fine-tuning of Fuzzy Software Cost Estimat...
Particle Swarm Optimization in the fine-tuning of Fuzzy Software Cost Estimat...
Β 
An Efficient Multiplierless Transform algorithm for Video Coding
An Efficient Multiplierless Transform algorithm for Video CodingAn Efficient Multiplierless Transform algorithm for Video Coding
An Efficient Multiplierless Transform algorithm for Video Coding
Β 
AMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLTAMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLT
Β 
Exploring the Impact of Magnitude- and Direction-based Loss Function on the P...
Exploring the Impact of Magnitude- and Direction-based Loss Function on the P...Exploring the Impact of Magnitude- and Direction-based Loss Function on the P...
Exploring the Impact of Magnitude- and Direction-based Loss Function on the P...
Β 
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
Β 
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
Β 
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONEXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
Β 
Boosting auxiliary task guidance: a probabilistic approach
Boosting auxiliary task guidance: a probabilistic approachBoosting auxiliary task guidance: a probabilistic approach
Boosting auxiliary task guidance: a probabilistic approach
Β 
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
Β 

Recently uploaded

Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
Β 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture designssuser87fa0c1
Β 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
Β 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
Β 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
Β 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
Β 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
Β 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
Β 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
Β 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
Β 
Study on Air-Water & Water-Water Heat Exchange in a Finned ο»ΏTube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned ο»ΏTube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned ο»ΏTube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned ο»ΏTube ExchangerAnamika Sarkar
Β 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
Β 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
Β 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
Β 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
Β 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
Β 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
Β 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
Β 

Recently uploaded (20)

Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
Β 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture design
Β 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
Β 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
Β 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
Β 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Β 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
Β 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
Β 
young call girls in Green ParkπŸ” 9953056974 πŸ” escort Service
young call girls in Green ParkπŸ” 9953056974 πŸ” escort Serviceyoung call girls in Green ParkπŸ” 9953056974 πŸ” escort Service
young call girls in Green ParkπŸ” 9953056974 πŸ” escort Service
Β 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
Β 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
Β 
Study on Air-Water & Water-Water Heat Exchange in a Finned ο»ΏTube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned ο»ΏTube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned ο»ΏTube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned ο»ΏTube Exchanger
Β 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Β 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
Β 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
Β 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
Β 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
Β 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
Β 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
Β 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
Β 

From_seq2seq_to_BERT

  • 1. From Seq2Seq to BERT Huali Z ABSTRACT In the end of 2018, BERT[5] caused a stir in machine learning with 7.7% point improvement on GLUE(General Langugage Understand- ing Evaluation) score and 5.1 point improvement on SQuAD(Stanford Question Answering Dataset) v2.0. Increasing numbers of pre- training models have been published based on BERT. Records of model scores on SQuAD, GLUE and MultiNLI(Multi-Genre Natural Language Inference) has been refreshed every year. The nitty-gritty of BERT is Transformer[16] and bi-directional training. For Trans- former, Attention is all you need [16]. It is only 6 years of time since Seq2Seq[15] model was published in 2014. How the models evolved to consume less computer power, to enable parallel computing and to get better scores? This paper illustrates a short history of the NLP models evolution step by step. From Seq2Seq model to BERT, from intermediate fixed size vector to Transformer. It also connects the dots of lingoes like Seq2Seq, fixed size vector, attention, soft alignment, attention, Transformer, pre-training, fine-tuning, and BERT. 1 INTRODUCTION Staring with Seq2Seq which solved the problem that Deep Neural Networks can only be applied to problem who’s input and targets can be sensibly encoded with vectors of fixed dimensionality[15], this paper states a short evolution history of NLP models from 2014 to 2020 when variants of BERT models are published. Seq2Seq model proposed a structure with a fixed size vector be- tween input and output layer. This structure fixed the dimension- ality problem of deep neural networks by the intermediate vector. When it comes to long sentences, this fixed size vector can be a bottleneck for performance. Attention model introduced a encoder and decoder structure with soft alignment. After the attention model, different types of atten- tion models are innovated. Soft attention works well on getting context information, however it is expensive. Hard attention is less expensive but requires more complicated techniques. Global attention takes into all hidden states, where as local attention only takes in a subset of hidden states. Self attention is using positions of a sequence to produce encoding a word in the same sequence. Attention can also be categorized by attention score calculation methods. Dot production attention and concat attention are most used methods. Vaswani’s research has shown that dot production attention is much faster and more space efficient. RNNs in attention models can be an obstacle for parallelism and resource efficiency. Transformer solves the problem by relying on self attention completely without RNNs. Along with self attention, Transformer also combines the techniques like scaled-dot attention, multi-head attention, position encoding and masking to achieve the best result. Based on Transformer, BERT was innovated for pre-training. The ground breaking feature of BERT is bidirectional training of Trans- former. It allows BERT to understand context better and enables information flow between layers. Variants of BERT has been pub- lished after BERT standing under the spotlight. Models like Span- BERT and SesameBERT improves the attention span to get better understanding of context. ALBERT, DistilBERT, TinyBERT and Mo- bileBERT improves cost efficiency while keeps same performance. SciBERT and BioBERT gives pre-trained model for domain specific NLP tasks. 2 SEQ2SEQ Deep neural networks has shown its power on parallel computation and back-propagation. At the same time, it has limitation on the dimensionality on input and outputs. The input and output has to be encoded with fixed size vectors. In real life, for many problems the best expression of input and output vector sizes are not known ahead of the time. For example, sequence problem like voice recog- nition. Seq2Seq model[15] was created for language modeling. It is nor- mally used for translation, summary extraction and question answer bots . The model contains two parts: encoder and decoder. Encoder maps the input sequence into a fixed size vector and decoder out- puts a sequence from the vector. Both the encoder and decoder are multi-layered LSTM. The goal for the LSTM is to estimates the conditional probability 𝑝(𝑦1, ...,𝑦𝑇′ |π‘₯1, ...,π‘₯𝑇 ) [15] where the input sequence length T is different from the output sequence length 𝑇 β€² . The mathematical representation of this model is: 𝑝(𝑦1, ...,𝑦𝑇′ |π‘₯1, ...,π‘₯𝑇 ) = 𝑇′ Γ– 𝑖=1 𝑝(𝑦𝑑 |𝑣,𝑦1, ...,π‘¦π‘‘βˆ’1) (1) In the equation, v is a fixed sized vector which produced by the last hidden state of input LSTM. This fixed size vector is being used as the initial input of the output LSTM. Each p(y_t | v, y_1, ... , y_t-1) is represented with a softmax all over the words in the volcabulary. [15] There are 3 features which are not included in the math representa- tion. (1) Two different LSTM is used for input sequence and output se- quence respectively. It helps on the parallelism while training the model with different languages with small computational cost. (2) A deep LSTM with 4 layers is used because deep LSTM works better than shallow LSTM in this case. (3) Reversing the words of input sequence introduced short term dependencies which improves the model’s performance on long sentences. 3 ATTENTION The fixed size vector in Seq2Seq model enables modeling without knowing input and output sequence size. However, when it comes to long sentences, it is a bottleneck for improving performance. Because all information is compressed into the fixed size vector, the beginning part of the information is forgotten. Bahdanau et all[2]
  • 2. Huali Z Figure 1: Attention architecture [2] proposed a new mechanism which abort the fixed size vector and introduced a new architecture of encoder and decoder. As Figure 1 illustrates, the encoder converts input sequence x into annotations h by a bi-directional RNN. These annotations are sum weighted into a context vector c. Using the context vector c, along with the previous hidden state π‘ π‘–βˆ’1 and previous output π‘¦π‘–βˆ’1, the decoder computes out the output 𝑦𝑑 . Using bi-directional RNN enables the encoder to calculates hidden states in both directions. So for each wordπ‘₯𝑗 the annotation is a concatenationβ„Žπ‘— = [ βˆ’ βˆ’ β†’ β„Žπ‘—π‘‡ : ← βˆ’ βˆ’ β„Žπ‘—π‘‡ ]. The equation of this model can be written as: 𝑝(𝑦𝑖 |𝑦1, ...,π‘¦π‘–βˆ’1,𝑋) = 𝑔(π‘¦π‘–βˆ’1,𝑠𝑖,𝑐𝑖) (2) where: 𝑠𝑖 = 𝑓 (π‘ π‘–βˆ’1,π‘¦π‘–βˆ’1,𝑐𝑖) (3) 𝑐𝑖 = 𝑇π‘₯ Γ• 𝑗=1 π‘Žπ‘–π‘—β„Žπ‘— (4) π‘Žπ‘–π‘— = exp𝑒𝑖𝑗 Í𝑇π‘₯ π‘˜=1 expπ‘’π‘–π‘˜ (5) 𝑒𝑖𝑗 = π‘Ž(π‘ π‘–βˆ’1,β„Žπ‘— ) (6) 𝑠𝑖 in equation 3 is the hidden state of the RNN at time i. 𝑐𝑖 in equation 4 is the context vector which is a weighted sum of the an- notation β„Žπ‘— . The annotations sequence (β„Ž1, ...,β„Žπ‘‡π‘₯ ) is mapped from the input sequence. Each annotation β„Žπ‘– contains the information of all the inputs with a strong focus on the words around i-th word. The weight π‘Žπ‘–π‘— in equation 5 is calculated by an alignment model in equation 6. The alignment model scores how well the inputs around position j and the output at position i match. It is trained and optimized along with the whole translation model jointly[2]. From the equations, we can see that the next output 𝑦𝑑 and hidden state 𝑠𝑑 is depended on the previous hidden state π‘ π‘‘βˆ’1 and annota- tion β„Žπ‘— . This design implements the attention mechanism in the decoder. With attention, the fixed size vector is no longer needed in this whole model. Instead, a soft alignment between the input and output is learnt. This improves the model’s performance with long sentences significantly. 3.1 Attention Family Depends on how the alignment score is calculated, attention mecha- nism can be categorized to dot-product[11], scaled dot-product[16], concat[2], general and others. In a broader categories attention mechanisms, there is global attention[11], local attention[11], hard attention[11], soft attention[2] and self attention[4]. 3.1.1 hard vs soft. In a soft attention model, the alignment vector is learnt by an alignment model. See Equation 6. The alignment model directly computes a soft alignment, which allows the gradient of the cost function to be back-propagated through. This gradient can be used to train the alignment model as well as the whole translation model jointly[2]. For example, in a English to French translation task, soft alignment allows the model to translate the word ’the’ to ’le’ and ’la’ dynamically based on the context word. The drawback of this model is when input sequence is long, the computation is expensive. In hard attention, the alignment vector π‘Žπ‘‘ is learnt with a select part of the input sequence. It is less expensive than the soft attention model but it is non-differentiable and requires more complicated techniques such as variance reduction or reinforcement learning to train[11]. 3.1.2 global vs local. The idea of global attetntion model is to consider all the hidden states of the encoder when deriving the context vector 𝑐𝑑 [11]. As Figure 2 shows, all the hidden states are used to calculate the alignment vector π‘Žπ‘‘ at time t. Then the global context vector 𝑐𝑑 is generated by doing weighted average on each π‘Žπ‘‘ . Using all the hidden states can be expensive and can potentially render it impractical to translate longer sequences, e.g. paragraphs or documents[11]. Local attention chose to pay attention to only a subset of the hidden states to get the weight vector π‘Žπ‘‘ . Figure 3 illustrates that the alignment vector π‘Žπ‘‘ is got from part of the input hidden states. These input hidden states are within a window [𝑝𝑑 - D, 𝑝𝑑 +D]. 𝑝𝑑 is aligned position and D is a selected value. The contet vector 𝑐𝑑 is a weighted averaged value over π‘Žπ‘‘ . Unlike the global attention model, π‘Žπ‘‘ now is a fixed size vector. To avoid the non-differentiable alignment problem in the hard attention model. This local attention model predicts the aligned position 𝑝𝑑 using equation 7 where the model parameters 𝑀𝑝 and 𝑣𝑝 will be learnt for predicting 𝑝𝑑 . 𝑝𝑑 = 𝑆 Γ— π‘†π‘–π‘”π‘šπ‘œπ‘–π‘‘(𝑣𝑇 𝑝 π‘‘π‘Žπ‘›β„Ž(π‘Šπ‘β„Žπ‘‘ )) (7) 3.1.3 self attention. Self attention is also called intra-attention[4]. It is a mechanism using ’attentioned’ positions of a sequence to produce encoding of a word in the same sequence. Figure 4 shows how the much attention is paid in a machine reading task using self attention. The model processes text incrementally while learning which past tokens in the memory and to what extent they relate to the current token being processed[4]. 3.1.4 alignments score calculation methods. There are variant ways of calculating alignment score. Score here means how well a source sequence hidden state and a target sequence hidden state is aligned. Equation 5 is a good example of alignment score calculation method. Bahdanau’s research employed a feed forward network which is the
  • 3. From Seq2Seq to BERT Figure 2: Global attention[11] Figure 3: Local attention in machine reading task[11] Figure 4: self attention[4] a in equation 6 to derive the alignment score. This way of getting the alignment score is referenced as concat in Luong’s research[11] and additive addition in Vaswani’s paper[16]. Beside this method, Lu- ong’s research also explored ’dot production’ π‘ π‘π‘œπ‘Ÿπ‘’(β„Žπ‘‘,β„Žπ‘ ) = β„Žπ‘‡ 𝑑 Β― β„Žπ‘  [11] , ’general’ π‘ π‘π‘œπ‘Ÿπ‘’(β„Žπ‘‘,β„Žπ‘ ) = β„Žπ‘‡ 𝑑 π‘Šπ‘Ž Β― β„Žπ‘  [11] and ’location based’ function π‘Žπ‘‘,𝑖 = π‘ π‘œπ‘“ π‘‘π‘šπ‘Žπ‘₯(π‘Šπ‘Žβ„Žπ‘‘ ) [11]. The research shows that the concat method is no better than the others. The dot production attention and concat attention are most used methods for getting alignment scores. Dot production attention is much faster and Figure 5: Scaled dot attention[16] more space efficient in practice, since it can be implemented us- ing highly optimized matrix multiplication code[16]. Vaswani’s research introduced a new method called ’scaled dot production’ π‘ π‘π‘œπ‘Ÿπ‘’(β„Žπ‘‘,β„Žπ‘ ) = β„Žπ‘‡ 𝑑 Β― β„Žπ‘  √ 𝑛 which is similar to ’dot production’ but with the source dimension as a scaling factor. Adding this scaling fac- tor helps the model dealing with the large values of the source dimension. 4 TRANSFORMER Using RNNs in sequential modeling was no doubt a state of art approach before the Transformer architecture. RNNs can capture the coherence in the sequence with hidden states. However, this in- herently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constrains limit batching across examples[16]. Vaswani’s work proposed a Transformer model which is relying on self atten- tion mechanism entirely without RNNs. 4.1 Transformer components 4.1.1 Transformer’s self scaled-dot attention. The attention mech- anism firstly creates three vectors: Q(query), K(key), and V(value) for each word embedding. These vectors are created by multiplying the input embedding with three metrics π‘€π‘ž, π‘€π‘˜, and 𝑀𝑣. It then calculates the score for each word in the input sentence, divides the the scores by the square root of the K dimension, softmax the dividended score, and finally multiples V vector with the sfotmaxed score. The whole process is shown as in Equation 8[16] and Figure 5[16]. Using scaled dot function helps on getting more stabled gra- dients and softmax function makes sure the scores at each input sequence position will add up to 1. π΄π‘‘π‘‘π‘’π‘›π‘‘π‘–π‘œπ‘›(𝑄, 𝐾,𝑉 ) = π‘†π‘œπ‘“ π‘‘π‘šπ‘Žπ‘₯( 𝑄𝐾𝑇 p π‘‘π‘˜ )𝑉 (8) 4.1.2 Multi-head attention. As Figure 6 shows, multi-head atten- tion linearly projects keys, values and queries h times. For each projected version of K, V and Q pairs, attention layer applied on. Each attention layer produces output values in parallel for each version of projected K, V and Q. These outputs are concatenated
  • 4. Huali Z Figure 6: multi-head attention[16] and linear projected to get the final value. By doing this multi-head attention, the model increase its ability to focus on different parts of input sequence by taking in information from different representa- tions jointly. The equation of this multi-head attention mechanism can be written as equation 9[16]. π‘€π‘’π‘™π‘‘π‘–π»π‘’π‘Žπ‘‘(𝑄, 𝐾,𝑉 ) = πΆπ‘œπ‘›π‘π‘Žπ‘‘(β„Žπ‘’π‘Žπ‘‘1, ...,β„Žπ‘’π‘Žπ‘‘β„Ž)π‘Šπ‘œ (9) where β„Žπ‘’π‘Žπ‘‘π‘– = π΄π‘‘π‘‘π‘’π‘›π‘‘π‘–π‘œπ‘›(π‘„π‘Š 𝑄 𝑖 , πΎπ‘Š 𝐾 𝑖 ,π‘‰π‘Šπ‘‰ 𝑖 ) (10) 4.1.3 position encoding. Removing the RNNs from the model dis- carded the hidden states which can indicates the inherent position information. To bring in the position information, Transformer in- jects positional encoding in the input embedings for both encoder and decoder. It adds the absolute or relative position information into the projected K, V and Q vectors. 4.1.4 masking. The self attention layer in decoder is different from the one in encoder. Masking is applied in the decoder self attention layer to ensure the decoder prediction only takes in words in earlier positions. 4.2 Transformer architecture The architecture of Transformer is illustrated in Figure7. Like Seq2Seq model and other attention models, Transformer also has encoder and decoder structure. Both encoder and decode is com- posed by 6 identical layers. 6 is a chosen empirical number. All the encoder layers contain two sub-layers: multi-head self attention layer and feed forward network layer. After positional encoding, the input sequence flows into the multi-attention layer so the model can encode a word with its surrounding words. The output of the multi-attention layer is fed into a feed forward neural network which is applied to each position. The decoder has similar structure as the encoder but with one more multi-head attention layer to take in the output from the encoder. The output of the decoder will pass through a linear and a softmax layer so the output vector can be mapped to a word. Figure 7: Transformer[16] Figure 8: BERT framework [5] 5 BERT BERT, which stands for Bidirectional Encoder Representations from Transformers[5], is a pre-training transformer encoder which im- proves fine-tuning by bidirectional training. When applying the pre- trained sequence to a specific downstream task, there are two main strategies can be used: feature-based and fine-tuning. Feature-based strategy is using the pre-trained representations as an extra feature. Fine-tuning is simply fine-tuning the task specific parameters. The pre-training models before BERT are unidirectional models which restricts the performance of context understanding. BERT intro- duced a state of art MLM(masked language model) along with "next sentence prediction" to enable bidirectional representation. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context[5].
  • 5. From Seq2Seq to BERT 5.1 Optimized BERT 5.1.1 Improved training procedure. After evaluating BERT param- eters with datasets, Liu et all, found that BERT is significantly undertrained. RoBERTa(Robustly optimized BERT approach) [10] is created for providing a better way for training BERT models. This approach includes (1) training the model longer, with bigger batches, over more data; (2) removing the next sentence prediction objective; (3) training on longer sequences; and (4) dynamically changing the masking pattern applied to the training data[10]. With the approaches, RoBERTa get improved performance when com- paring with 𝐡𝐸𝑅𝑇𝐿𝐴𝑅𝐺𝐸. 5.1.2 Improved attention span. SpanBERT[7] is an extended BERT with randomly masked spans and SBO(span boundary objective). Comparing with the masked tokens in vanilla BERT, masked span provides a better context representation. SBO enables the model learns to predict the entire masked span from the observed tokens at its boundary without relying on the individual token representa- tions within it[7]. SesameBERT[13] enables information flow across BERT stacked lay- ers through Squeeze and Excitation. It also enriches local informa- tion by captureing neighboring context via Gaussian blurring[13]. 5.1.3 Improved cost efficiency. As pre-training model size grows for better performance in fine-tuning, memory limitation of avail- able hardware is an obstacle for getting a better scaled pre-trained model. The existing model parallelization and clever memory man- agement method can mitigate the memory shortage issue but not communication overhead. ALBERT(A lite BERT)[8] is born for all the problems. ALBERT provides a BERT model with less parameters so it consumes lower memory and runs in a faster speed. There are two parameter reduction strategies used in ALBERT. The first one is a factorized embedding parameterization. The second one is cross-layer parameter sharing [8]. To alleviate the resource limitation, DistilBERT [12] introduce a triple loss combining language modeling, distillation and cosine- distance losses[12]. It reduces BERT model size by 40% and speeds up the model 60% while keeping 97% of the model’s understand- ing ability. Similar modified BERT models are MobileBERT [14] and TinyBERT[6]. MobileBERT’s goal is building a lightweight pre-trained model for different downstream tasks. TinyBERT is 7.5 times smaller and 9.4 faster with a novel Transformer distillation method and a two-stage learning method. 5.1.4 Domain oriented improvement. On specific domains, lack of high quality and larges scaled data is limiting BERT performance. Domain-specific pre-trained BERT models are published for provid- ing generic but specific language representation models. For exam- ple, SciBERT[3] and BioBERT[9] is trained on large multi-domain scientific publication corpora and biomedical corpora respectively. 6 CONCLUSION Exponential number of research is done with variants of BERTs in two years of time. While reading the papers, understating concepts like Attention helps to know what a model is based on. This pa- per introduces a short history of NLP models: Seq2Seq, Attention, Transformer and BERT. All these models are aiming at getting bet- ter performance and being less time and memory expensive at the same time. REFERENCES [1] Jay Alammar. 2019. The illustrated Transformer. http://jalammar.github.io/ illustrated-transformer/ [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. NEURAL MA- CHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE. (2015). arXiv:1409.0473v7 [3] Iz Beltagy, Kyle Lo, and Arman Cohan. 2020. SCIBERT: A pretrained language model for scientific text. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference (2020), 3615–3620. arXiv:arXiv:1903.10676v3 [4] Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long Short-Term Memory- Networks for Machine Reading. (jan 2016). arXiv:1601.06733 http://arxiv.org/ abs/1601.06733 [5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (oct 2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805 [6] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. Tinybert: Distilling bert for natural language understanding. arXiv (2019). arXiv:arXiv:1909.10351v5 [7] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2019. SpanBERT: Improving Pre-training by Representing and Predicting Spans. (jul 2019). arXiv:1907.10529 http://arxiv.org/abs/1907.10529 [8] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. (sep 2019). arXiv:1909.11942 http: //arxiv.org/abs/1909.11942 [9] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240. https://doi.org/10.1093/bioinformatics/btz682 arXiv:1901.08746 [10] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. (jul 2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692 [11] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effec- tive Approaches to Attention-based Neural Machine Translation. (aug 2015). arXiv:1508.04025 http://arxiv.org/abs/1508.04025 [12] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distil- BERT, a distilled version of BERT: smaller, faster, cheaper and lighter. (oct 2019). arXiv:1910.01108 http://arxiv.org/abs/1910.01108 [13] Ta-Chun Su and Hsiang-Chih Cheng. 2019. SesameBERT: Attention for Any- where. (oct 2019). arXiv:1910.03176 http://arxiv.org/abs/1910.03176 [14] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 2020. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices. (apr 2020). arXiv:2004.02984 http://arxiv.org/abs/2004.02984 [15] Ilya Sutskever Google, Oriol Vinyals Google, and Quoc V Le Google. 2014. Se- quence to Sequence Learning with Neural Networks. Technical Report. [16] Ashish Vaswani, Google Brain, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. Technical Report. arXiv:1706.03762v5 [17] Lilian Weng. 2019. Attention? Attention! https://lilianweng.github.io/lil-log/ 2018/06/24/attention-attention.html