Monotonic Multihead Attention review

Monotonic Multihead Attention
Presented by: June-Woo Kim
ABR lab, Department of Artificial Intelligence,
Kyungpook National University
2, Dec. 2020.
Xutai Ma et al.
Facebook, Johns Hopkins Univ.
ICLR 2020

Presentation contents
• Abstract
• Background
• Problem definition
• Proposed architecture
• Experiments and results
• Reference
• Appendix

Abstract
• This paper extends previous models for monotonic attention to the multi-head attention used in Transformers [1],
yielding “Monotonic Multi-head Attention”.
• The proposed method is a relatively straightforward extension of the previous Hard [2] and Infinite Lookback [3]
monotonic attention models.
• This paper achieves better latency-quality tradeoffs in simultaneous Machine Translation tasks in two language pairs.
• Also, this paper is a meaningful contribution to the task of simultaneously Machine Translation.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
[2] Raffel, Colin, et al. "Online and Linear-Time Attention by Enforcing Monotonic Alignments." International Conference on Machine Learning. 2017.
[3] Arivazhagan, Naveen, et al. "Monotonic Infinite Lookback Attention for Simultaneous Machine Translation." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.

Background: Simultaneous Translation
• Start translating before reading the whole source sentence.
• Applications
• Video Subtitle Translation.
• International Conferences.
• Personal Translation Assistant.
• Goal: We do not want to wait for the completion of the full sentence before we start translation!

• Same grammar structure case  Simple!
• English:
• A republican strategy to counter the re-election of Obama. (SVO)
• Spanish:
• Una estrategia republicana para contrarrestar la reelección de Obama. (SVO)
• S: subject, O: object, V: verb
Action English Spanish
Read A Republican strategy
Write Una estrategia republicana
Read to counter
Write para contrarrestar
Read the re-election of Obama
Write la reelección de Obama

• Different grammatical structure case  More complex
• English:
• A republican strategy to counter the re-election of Obama. (SVO)
• Korean:
• Obama 재선을 대응하기 위한 공화당 전략. (SOV)
• S: subject, O: object, V: verb
Action English German
Read A Republican strategy
Write 공화당 전략
Read to counter
Read the re-election of Obama
Write 대응하다
Write Obama 재선

Background on the problem: Current Approaches
1. Fixed Policy – Weaker performance
• Wait If-* [4] policy. (Rule-based)
• Wait-k [5] policy. (Use 𝑘 tokens)
• Incremental decoding [6]. (Incremental learning)
2. Reinforcement Learning – Less stable learning process.
• Markov chain [7] (Markov chain with RL)
• Make decisions on when to translate from the interaction [8]. (Used pre-trained offline model to teach agent)
• Continuous rewards policy [9]. (Rewards policy gradient for online alignments)
[4] Cho, Kyunghyun, and Masha Esipova. "Can neural machine translation do simultaneous translation?." arXiv preprint arXiv:1606.02012 (2016).
[5] Ma, Mingbo, et al. "STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework." Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. 2019.
[6] Dalvi, Fahim, et al. "Incremental Decoding and Training Methods for Simultaneous Translation in Neural Machine Translation." Proceedings of the 2018 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018.
[7] Grissom II, Alvin, et al. "Don’t until the final verb wait: Reinforcement learning for simultaneous machine translation." Proceedings of the 2014 Conference on empirical methods in natural language processing
(EMNLP). 2014.
[8] Gu, Jiatao, et al. "Learning to translate in real-time with neural machine translation." arXiv preprint arXiv:1610.00388 (2016).
[9] Luo, Yuping, et al. "Learning online alignments with continuous rewards policy gradient." 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.

Background on the problem: Current Approaches
3. Monotonic Attention (MA) – State of the art.
• Hard Monotonic Attention [2]. (First introduction of concept of MA)
• Monotonic Chunkwise Attention (MoChA) [10]. (Let the model attend to a chunk of encoder states)
• Monotonic Infinite Lookback Attention (MILk) [3]. (Introduced infinite lookback to improve the quality)
• Let’s focus and glance this mechanism!
[10] Chiu, Chung-Cheng, and Colin Raffel. "Monotonic Chunkwise Attention." International Conference on Learning Representations. 2018.

Background: Sequence-to-sequence
• Recently, the “Sequence-to-sequence” (Seq2Seq) [11] framework has facilitated the use of RNNs on sequence
transduction problems such as machine translation and speech recognition.
• Encoder: input sequence is processed with some networks. (e.g., RNNs, CNNs, FC-layers, hybrid, etc.)
• Decoder: produce the target sequence with output of the encoder. (almost RNNs)
• This often results in the model having difficulty to longer sequences than those seen during training.
[11] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.
Figure reference: https://brunch.co.kr/@kakao-it/155

Background: Attention mechanisms in Seq2seq
• Attention mechanism in Seq2seq
• Encoder: produces a sequence of hidden states which correspond to entries in the input sequence.
• Decoder: allow to refer back to any of the encoder states as it produces its output.
• In Seq2seq with attention, the encoder produces a sequence of hidden states which correspond to entries in the input
sequence!
• There are some effective attention mechanism with Seq2seq
• Bahdanau attention [12]. (additive attention)
• Luong attention [13]. (dot-product attention)
• Multi-head attention [1]. (scaled dot-product attention with multi-head. Each heads do scaled dot-production, and those
separated from number of heads)
[12] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
[13] Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective approaches to attention-based neural machine translation." arXiv preprint arXiv:1508.04025 (2015).

Background: Soft attention
• Encoder RNN do the given input sequence 𝑥 = {𝑥1, … , 𝑥 𝑇} to produce a sequence of hidden states ℎ = {ℎ1, … , ℎ 𝑇}.
• Referring to ℎ as the “memory” to emphasize its connection to memory-augmented neural networks
• Decoder RNN produces an output sequence 𝑦 = 𝑦1, … , 𝑦 𝑈 , conditioned on the memory.

Background: Soft attention (additive attention)
𝑒𝑖,𝑗 = 𝑎(𝑠𝑖−1, ℎ𝑗)
𝛼𝑖,𝑗 =
exp 𝑒𝑖,𝑗
𝑘=1
𝑇
exp 𝑒𝑖,𝑘
𝑐𝑖 =
𝑗=1
𝑇
𝛼𝑖,𝑗ℎ𝑗
𝑠𝑖 = 𝑓(𝑠𝑖−1, 𝑦𝑖−1, 𝑐𝑖)
𝑦𝑖 = 𝑔(𝑠𝑖, 𝑐𝑖)
• When computing 𝑦𝑖, a soft attention-based decoder uses a learnable nonlinear function 𝑎(∙) to produce a scalar value 𝑒𝑖,𝑗
for each entry ℎ𝑗 in the memory based on ℎ𝑗 and the decoder’s state at the previous timestep 𝑠𝑖−1.
• 𝑎(∙) is a single-layer neural network using a 𝑡𝑎𝑛ℎ nonlinearity, but other functions such as a simple dot product between 𝑠𝑖−1 and ℎ𝑗
have been used.
• 𝑐𝑖 is the weighted sum of ℎ.
• Decoder updates its state to 𝑠𝑖 based on 𝑠𝑖−1 and 𝑐𝑖 and produces 𝑦𝑖.
• 𝑓(∙) is a RNN (one or more LSTM or GRU) and 𝑔(∙) is a learnable nonlinear function which maps the decoder state to the
output space.

Problem definitions in Attention mechanism in Seq2seq
• Problem
• A common criticism of soft attention is that the model must perform a pass over the entire input sequence when producing
each element of the output sequence.
• This results in the decoding process having complexity 𝑂(𝑇𝑈), where 𝑇 and 𝑈 are the input and output sequence lengths
respectively.
• Furthermore, because the entire sequence must be processed prior to outputting any symbols, soft attention cannot be used
in “online” settings where output sequence elements are produced when the input has only been partially observed.

Background: Monotonic Attention (MA)
• Soft attention mechanisms perform a pass over the entire input sequence when producing each element in the
output sequence.
• Authors [2] proposed an end-to-end differentiable method for learning monotonic alignments which, at test time,
enables computing attention online and in linear time.
Conventional soft attention
(e.g., additive, dot product) Monotonic attention

Background: MA
• If for any output timestep 𝑖 it have 𝑧𝑖,𝑗 = 0 for 𝑗 ∈ {𝑡𝑖 − 1, . . . , 𝑇}, it can simply set 𝑐𝑖 to a vector of zeros.
• 𝑒𝑖,𝑗 = 𝑎(𝑠𝑖−1, ℎ𝑗)  𝑒𝑖,𝑗 = 𝑀𝑜𝑛𝑜𝑡𝑜𝑛𝑖𝑐𝐸𝑛𝑒𝑟𝑔𝑦(𝑠𝑖−1, ℎ𝑗)
• 𝑝𝑖,𝑗 = 𝜎(𝑒𝑖,𝑗)
• 𝑧𝑖,𝑗~Bernoulli(𝑝𝑖,𝑗)
• where 𝑎(·) is a learnable deterministic “energy function” and 𝜎(·) is the logistic sigmoid function.
• Note that the above model only needs ℎ 𝑘, 𝑘 ∈ {1, … , 𝑗} compute ℎ𝑗.
• Time complexity will be 𝑂(max 𝑇, 𝑈 )
• Through this mechanism, we know that MA explicitly processes the input sequence in a left-to-right order and makes a hard assignment of
𝑐_𝑖 to one particular encoder state denoted ℎ_(𝑡_𝑖 ).

Background: MA
• This model cannot train using back-propagation because of sampling.
• So that, this model use the expectation of ℎ𝑗 during training (inspired from soft alignments) and try to induce
discreteness into 𝑝𝑖,𝑗.
• The 𝛼𝑖,𝑗 defines the probability that input time step 𝑗 is attended at output time step 𝑖.
• 𝛼𝑖,𝑗 = 𝑃𝑖 ℎ𝑗 𝑢𝑠𝑒𝑑 = 𝑃𝑖 ℎ𝑗 𝑢𝑠𝑒𝑑 ℎ𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑)𝑃𝑖(ℎ𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑)
• 𝑃𝑖 ℎ𝑗 𝑢𝑠𝑒𝑑 ℎ𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑 = 𝑝𝑖,𝑗
• 𝑃𝑖 ℎ𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑 = 𝑃𝑖 ℎ𝑗−1 𝑛𝑜𝑡 𝑢𝑠𝑒𝑑, ℎ𝑗−1 𝑐ℎ𝑒𝑐𝑘𝑒𝑑 + 𝑃𝑖−1 ℎ𝑗 𝑢𝑠𝑒𝑑 𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑)

Background: MA
• Stepwise Probability at target step 𝑗
𝑝𝑖,𝑗 =
> 0.5
< 0.5
• Expected Alignment (Monotonic Attention):
𝛼𝑖,: = 𝑝𝑖,: 𝑐𝑢𝑚𝑝𝑟𝑜𝑑 1 − 𝑝𝑖,: 𝑐𝑢𝑚𝑠𝑢𝑚(
𝛼 𝑖−1,:
𝑐𝑢𝑚𝑝𝑟𝑜𝑑(1−𝑝 𝑖,:)
) (1)
Where 𝑐𝑢𝑚𝑝𝑟𝑜𝑑 𝑥 = 1, 𝑥1, 𝑥1 𝑥2, … , 𝑖
𝑥 −1
𝑥𝑖 and 𝑐𝑢𝑚𝑠𝑢𝑚 𝑥 = 1, 𝑥1, 𝑥1 + 𝑥2, … , 𝑖
|𝑥|
𝑥𝑖
Resulting process only computes at most max(𝑇, 𝑈) terms 𝑝𝑖,𝑗 which is a linear runtime.
Read 𝑖-th source token,
Write (𝑗 + 1)-th source token.

Problem definition: MA
• Problem definitions
• Although this MA achieves online linear time decoding, the decoder can only attend to one encoder state.
• This limitation can diminish translation quality as there may be insufficient information for reordering.
• Moreover, the backpropagation is not available in the hard monotonic attention.

Background: Monotonic Chunkwise Attention (MoChA)
• Hard monotonic alignments are just too hard in their conditions!
• Using only one vector ℎ 𝑡,𝑖 as context vector 𝑐𝑖 is a little too much constraint.
• A novel solution to this problem is Monotonic Chunkwise Mechanism. (MoChA) [10]
• This allows the model to use soft attention over fixed-size chunks(e.g., size 𝑤) of memory ending at input time step
𝑡𝑖 for each output time step 𝑖.
𝑢𝑖,𝑘 = ChunkEnergy(𝑠𝑖−1, ℎ 𝑘) = 𝑣 𝑇
tanh(𝑊𝑠 𝑠𝑖−1 + 𝑊ℎℎ𝑗 + 𝑏)
𝑐𝑖 = 𝑘=𝑣
𝑡𝑖 exp(𝑢𝑖,𝑘)
𝑙=𝑣
𝑡 𝑖 exp(𝑢𝑖,𝑙)
ℎ 𝑘
[10] Chiu, Chung-Cheng, and Colin Raffel. "Monotonic Chunkwise Attention." International Conference on Learning Representations. 2018.

Problem definition: Two major limitations of the hard monotonic mechanism.
• Problem definitions in conventional methods.
• Not enough context in the context vector.
• Assumption of strict monotonicity in input and output alignments.
• RNN-based networks.

Proposed architecture: Monotonic Multihead Attention (MMA)
• The Transformer [1] architecture has recently become the state-of-the-art for machine translation [14].
• An important feature of the Transformer is the use of a separate multihead attention module at each layer.
• Thus, this paper propose a new approach, Monotonic Multihead Attention (MMA), which combines the expressive
power of multihead attention and the low latency of monotonic attention.
[14] Barrault, Loïc, et al. "Findings of the 2019 conference on machine translation (wmt19)." Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 2019.

Related works: Transformer
• Given queries 𝑄, keys 𝐾 and values 𝑉, multihead attention Multihead(𝑄, 𝐾, 𝑉) is defined as:
MultiHead(𝑄, 𝐾, 𝑉) = Concat(ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑 𝐻)𝑊 𝑂
where ℎ𝑒𝑎𝑑ℎ = Attention(𝑄𝑊ℎ
𝑄
, 𝐾𝑊ℎ
𝐾
, 𝑉𝑊ℎ
𝑉
)
• The attention function is the scaled dot-product attention, defined as:
Attention(𝑄, 𝐾, 𝑉) = Softmax(
𝑄𝐾 𝑇
𝑑 𝑘
)𝑉
• Now, we can know that multihead attention allows each decoder layer to have multiple heads, where each head can
compute a different attention distribution.
• No RNNs and CNNs in this network.
• Parallel method. (So fast!)
• Better performance in huge dataset.

Proposed architecture: MMA
• For a transformer with 𝐿 decoder layers and 𝐻 attention heads per layer, paper defines the selection process of the ℎ −th
head encoder-decoder attention in the 𝑙 −th decoder layer as:
𝑒𝑖,𝑗
𝑙,ℎ
= (
𝑚 𝑗 𝑊𝑙,ℎ
𝐾
𝑠 𝑖−1 𝑊𝑙,ℎ
𝑄 𝑇
𝑑 𝑘 𝑖
)𝑖,𝑗
𝑝𝑖,𝑗
𝑙,ℎ
= Sigmoid(𝑒𝑖,𝑗)
𝑧𝑖,𝑗
𝑙,ℎ
~ Bernoulli(𝑝𝑖,𝑗)
where 𝑊𝑙,ℎ is the input projection matrix, 𝑑 𝑘 is the dimension of the attention head.

• Independent stepwise selection probability: For layer 𝑙 and head ℎ
• 𝑝𝑖,𝑗
𝑙,ℎ
=
> 0.5
< 0.5
• Inference algorithm
• A source token is read if the fastest head decides to read.
• A target token is written if all the heads finish reading.
Layer 𝑙 head 𝑗 move one step forward
Layer 𝑙 head 𝑗 stop reading

• MMA-H
• Hard alignment.
• Potential for streaming.
• MMA-IL
• Infinite lookback.
• Good translation quality.

MMA-H
• For MMA-H(ard), this paper use Equation 1(in slide 24) in order to calculate the expected alignment for each layer
each head, given 𝑝𝑖,𝑗
𝑙,ℎ
.
• Each attention head in MMA-H attends to one encoder state.
• However there are more than two heads in each layer.
• Therefore, compared with the previously MA-based model, this MMA model is able to set attention to different positions.
• Even with the hard alignment variant (MMA-H), this model is still able to preserve the history information by setting
heads to past states.

MMA-IL
• For MMA-IL, authors calculate the Softmax energy for each head as follows:
• 𝑢𝑖,𝑗
𝑙,ℎ
= SoftEnergy =
𝑚 𝑗 𝑊𝑙,ℎ
𝐾
𝑠 𝑖−1 𝑊𝑙,ℎ
𝑄 𝑇
𝑑 𝑘
𝑖,𝑗
• And then allows the decoder to access encoder states from the beginning of the source sequence.
• Each attention head in MMA-IL can attend to all previous encoder states.
• So it takes more time than MMA-H.
• But quality of translation is better than MMA-H.
• MMA models use unidirectional encoders: the encoder self-attention can only attend to previous states, which is
also required for simultaneous translation.

Compare MMA-H to MMA-IL
• MMA-H
• This model is faster than MMA-IL  streaming.
• MMA-IL
• Better quality  good quality of translation.
• Thus, MMA-IL allows the model to leverage more information for translation, but MMA-H may be better suited for
streaming systems with stricter efficiency requirements!

Experiments
• Datasets
• IWSLT 2015 English-Vietnamese.
• WMT 2016 English-German.
• Latency Metrics
• Average Proportion (AP) [3].
• Average Lagging (AL) [4].
• Differentiable Average Lagging (DAL) [5].
• Quality Metrics
• BLEU score.
[4] Cho, Kyunghyun, and Masha Esipova. "Can neural machine translation do simultaneous translation?." arXiv preprint arXiv:1606.02012 (2016).
[5] Ma, Mingbo, et al. "STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework." Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. 2019.

Results: Latency-quality for MILk and MMA on IWSLT15 En-Vi and WMT15 De-EN.
• The BLEU and latency scores on the test set are generated by setting a latency range and selecting the checkpoint with best
BLEU score on the validation set.
• Black dashed line indicates the unidirectional offline transformer model with greedy search.
• While MMA-IL tends to have a decrease in quality as the latency decreases, MMA-H has a small gain in quality as latency
decreases: a larger latency does not necessarily mean an increase in source information available to the model.
• In fact, the large latency is from the outlier attention heads, which skip the entire source sentence and point to the end of the
sentence.

Results
• Note that latency increases with the number of attention heads.
• With 6 layers, the best performance is reached with 16 heads.

Conclusion
Summary
• This paper proposed two variants of the monotonic multihead attention model for simultaneous machine translation.
• Introduced two new targeted loss terms for latency control.
• Achieved better latency-quality trade-offs than the previous state-of-the-art model.

Reference
• [1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
• [2] Raffel, Colin, et al. "Online and Linear-Time Attention by Enforcing Monotonic Alignments." International
Conference on Machine Learning. 2017.
• [3] Arivazhagan, Naveen, et al. "Monotonic Infinite Lookback Attention for Simultaneous Machine Translation."
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
• [4] Cho, Kyunghyun, and Masha Esipova. "Can neural machine translation do simultaneous translation?." arXiv preprint
arXiv:1606.02012 (2016).
• [5] Ma, Mingbo, et al. "STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using
Prefix-to-Prefix Framework." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
2019.
• [6] Dalvi, Fahim, et al. "Incremental Decoding and Training Methods for Simultaneous Translation in Neural Machine
Translation." Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018.
• [7] Grissom II, Alvin, et al. "Don’t until the final verb wait: Reinforcement learning for simultaneous machine
translation." Proceedings of the 2014 Conference on empirical methods in natural language processing (EMNLP). 2014.
• [8] Gu, Jiatao, et al. "Learning to translate in real-time with neural machine translation." arXiv preprint
arXiv:1610.00388 (2016).

Reference
• [9] Luo, Yuping, et al. "Learning online alignments with continuous rewards policy gradient." 2017 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.
• [10] Chiu, Chung-Cheng, and Colin Raffel. "Monotonic Chunkwise Attention." International Conference on
Learning Representations. 2018.
• [11] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks."
Advances in neural information processing systems. 2014.
• [12] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to
align and translate." arXiv preprint arXiv:1409.0473 (2014).
• [13] Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective approaches to attention-based
neural machine translation." arXiv preprint arXiv:1508.04025 (2015).
• [14] Barrault, Loïc, et al. "Findings of the 2019 conference on machine translation (wmt19)." Proceedings of the
Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 2019.

Appendix: Conventional attention mechanism
• Bahdanau attention (additive attention)
𝑐𝑡 = 𝑗=1
𝑇𝑥
𝑎 𝑡𝑗ℎ𝑗
 𝐻𝑎 𝑡
𝑎 𝑡 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑆𝑐𝑜𝑟𝑒 𝑠𝑡−1, ℎ𝑗
𝑗=1
𝑇𝑥
∈ ℝ 𝑇𝑥
𝑆𝑐𝑜𝑟𝑒 𝑠𝑡−1, ℎ𝑗 = 𝑣 𝑇tanh(𝑊𝑎 𝑠𝑡−1 + 𝑈 𝑎ℎ𝑗)
Where 𝑠𝑡 is hidden state vector, 𝑐𝑡 is context vector,
𝑦𝑡−1 is input of current time step.

Appendix: Conventional attention mechanism
• Luong attention (dot product attention) 𝑐𝑡 = 𝑗=1
𝑇𝑥
𝑎 𝑡𝑗ℎ𝑗
 𝐻𝑎 𝑡
𝑎 𝑡 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑆𝑐𝑜𝑟𝑒 𝑠𝑡, ℎ𝑗
𝑗=1
𝑇𝑥
∈ ℝ 𝑇𝑥
𝑦𝑡 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑊𝑦 𝑠𝑡 + 𝑏 𝑦
𝑠 = tanh(𝑊𝑠𝑠 𝑠𝑡 + 𝑊𝑐𝑠 𝑐𝑡 + 𝑏𝑠)
The different things from Bahdanau attention is:
- Using 𝑠𝑡 instead of 𝑠𝑡−1
- It use 𝑐𝑡
Computation path in the Luong attention is further
simplified because the part for obtaining the
output and the part for performing the recursive
operation of the RNN can be separated.

Appendix: Soft monotonic attention decoder algorithm (training phase)

Appendix: Hard monotonic attention decoder algorithm (evaluation phase)

More information were in the paper.
• Expected delay, for layer 𝑙 head ℎ
𝑔𝑖
𝑙,ℎ
= 𝑗=1
|𝑥|
𝑗𝛼𝑖,𝑗
ℎ,𝑙
• Weighted average loss
𝑔𝑖
𝑊
=
exp(𝑔𝑖
𝑙,ℎ
)
𝑙=1
𝐿
ℎ=1
𝐻 exp(𝑔𝑖
𝑙,ℎ
)
𝑔𝑗
𝑙,ℎ
 𝐿 𝑎𝑣𝑔 = 𝐶(𝑔 𝑊
)
• Differentiable average lagging
𝐿 𝑣𝑎𝑟 =
1
|𝑦|𝐿𝐻 𝑖=1
|𝑦|
𝑙=1
𝐿
ℎ=1
𝐻
𝑔𝑖
𝑙,ℎ
− 𝑔𝑖
2
Appendix: Latency and Attention Span Control

Appendix: Latency Metric Calculation
• Average Proportion
1
𝑥 |𝑦|
𝑖=1
|𝑦|
𝑔𝑖
• Average Lagging
1
𝜏
𝑖=1
𝜏
𝑔𝑖 −
𝑖 − 1
𝑦 /|𝑥|
𝜏 = argmax𝑖(𝑔𝑖 = 𝑥 )
• Differential Average Lagging
1
|𝑦|
𝑖=1
|𝑦|
𝑔𝑖
′
−
𝑖 − 1
𝑦 /|𝑥|
𝑔𝑖
′
=
𝑔𝑖 , 𝑖 = 0
max 𝑔𝑖, 𝑔𝑖−1
′
+
𝑦
𝑥
, 𝑖 < 0
where 𝑥=source, 𝑦=target, 𝑔=delays

Monotonic Multihead Attention review

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Monotonic Multihead Attention review

Similar to Monotonic Multihead Attention review (20)

Recently uploaded

Recently uploaded (20)

Monotonic Multihead Attention review

Editor's Notes