SlideShare a Scribd company logo
Monotonic Multihead Attention
Presented by: June-Woo Kim
ABR lab, Department of Artificial Intelligence,
Kyungpook National University
2, Dec. 2020.
Xutai Ma et al.
Facebook, Johns Hopkins Univ.
ICLR 2020
Presentation contents
• Abstract
• Background
• Problem definition
• Proposed architecture
• Experiments and results
• Reference
• Appendix
Abstract
• This paper extends previous models for monotonic attention to the multi-head attention used in Transformers [1],
yielding “Monotonic Multi-head Attention”.
• The proposed method is a relatively straightforward extension of the previous Hard [2] and Infinite Lookback [3]
monotonic attention models.
• This paper achieves better latency-quality tradeoffs in simultaneous Machine Translation tasks in two language pairs.
• Also, this paper is a meaningful contribution to the task of simultaneously Machine Translation.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
[2] Raffel, Colin, et al. "Online and Linear-Time Attention by Enforcing Monotonic Alignments." International Conference on Machine Learning. 2017.
[3] Arivazhagan, Naveen, et al. "Monotonic Infinite Lookback Attention for Simultaneous Machine Translation." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
Background: Simultaneous Translation
• Start translating before reading the whole source sentence.
• Applications
• Video Subtitle Translation.
• International Conferences.
• Personal Translation Assistant.
• Goal: We do not want to wait for the completion of the full sentence before we start translation!
Background: Simultaneous Translation
• Same grammar structure case  Simple!
• English:
• A republican strategy to counter the re-election of Obama. (SVO)
• Spanish:
• Una estrategia republicana para contrarrestar la reelección de Obama. (SVO)
• S: subject, O: object, V: verb
Action English Spanish
Read A Republican strategy
Write Una estrategia republicana
Read to counter
Write para contrarrestar
Read the re-election of Obama
Write la reelección de Obama
Background: Simultaneous Translation
• Different grammatical structure case  More complex
• English:
• A republican strategy to counter the re-election of Obama. (SVO)
• Korean:
• Obama 재선을 대응하기 위한 공화당 전략. (SOV)
• S: subject, O: object, V: verb
Action English German
Read A Republican strategy
Write 공화당 전략
Read to counter
Read the re-election of Obama
Write 대응하다
Write Obama 재선
Background on the problem: Current Approaches
1. Fixed Policy – Weaker performance
• Wait If-* [4] policy. (Rule-based)
• Wait-k [5] policy. (Use 𝑘 tokens)
• Incremental decoding [6]. (Incremental learning)
2. Reinforcement Learning – Less stable learning process.
• Markov chain [7] (Markov chain with RL)
• Make decisions on when to translate from the interaction [8]. (Used pre-trained offline model to teach agent)
• Continuous rewards policy [9]. (Rewards policy gradient for online alignments)
[4] Cho, Kyunghyun, and Masha Esipova. "Can neural machine translation do simultaneous translation?." arXiv preprint arXiv:1606.02012 (2016).
[5] Ma, Mingbo, et al. "STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework." Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. 2019.
[6] Dalvi, Fahim, et al. "Incremental Decoding and Training Methods for Simultaneous Translation in Neural Machine Translation." Proceedings of the 2018 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018.
[7] Grissom II, Alvin, et al. "Don’t until the final verb wait: Reinforcement learning for simultaneous machine translation." Proceedings of the 2014 Conference on empirical methods in natural language processing
(EMNLP). 2014.
[8] Gu, Jiatao, et al. "Learning to translate in real-time with neural machine translation." arXiv preprint arXiv:1610.00388 (2016).
[9] Luo, Yuping, et al. "Learning online alignments with continuous rewards policy gradient." 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.
Background on the problem: Current Approaches
3. Monotonic Attention (MA) – State of the art.
• Hard Monotonic Attention [2]. (First introduction of concept of MA)
• Monotonic Chunkwise Attention (MoChA) [10]. (Let the model attend to a chunk of encoder states)
• Monotonic Infinite Lookback Attention (MILk) [3]. (Introduced infinite lookback to improve the quality)
• Let’s focus and glance this mechanism!
[2] Raffel, Colin, et al. "Online and Linear-Time Attention by Enforcing Monotonic Alignments." International Conference on Machine Learning. 2017.
[10] Chiu, Chung-Cheng, and Colin Raffel. "Monotonic Chunkwise Attention." International Conference on Learning Representations. 2018.
[3] Arivazhagan, Naveen, et al. "Monotonic Infinite Lookback Attention for Simultaneous Machine Translation." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
Background: Sequence-to-sequence
• Recently, the “Sequence-to-sequence” (Seq2Seq) [11] framework has facilitated the use of RNNs on sequence
transduction problems such as machine translation and speech recognition.
• Encoder: input sequence is processed with some networks. (e.g., RNNs, CNNs, FC-layers, hybrid, etc.)
• Decoder: produce the target sequence with output of the encoder. (almost RNNs)
• This often results in the model having difficulty to longer sequences than those seen during training.
[11] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.
Figure reference: https://brunch.co.kr/@kakao-it/155
Background: Attention mechanisms in Seq2seq
• Attention mechanism in Seq2seq
• Encoder: produces a sequence of hidden states which correspond to entries in the input sequence.
• Decoder: allow to refer back to any of the encoder states as it produces its output.
• In Seq2seq with attention, the encoder produces a sequence of hidden states which correspond to entries in the input
sequence!
• There are some effective attention mechanism with Seq2seq
• Bahdanau attention [12]. (additive attention)
• Luong attention [13]. (dot-product attention)
• Multi-head attention [1]. (scaled dot-product attention with multi-head. Each heads do scaled dot-production, and those
separated from number of heads)
[12] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
[13] Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective approaches to attention-based neural machine translation." arXiv preprint arXiv:1508.04025 (2015).
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
Background: Soft attention
• Encoder RNN do the given input sequence 𝑥 = {𝑥1, … , 𝑥 𝑇} to produce a sequence of hidden states ℎ = {ℎ1, … , ℎ 𝑇}.
• Referring to ℎ as the “memory” to emphasize its connection to memory-augmented neural networks
• Decoder RNN produces an output sequence 𝑦 = 𝑦1, … , 𝑦 𝑈 , conditioned on the memory.
Background: Soft attention (additive attention)
𝑒𝑖,𝑗 = 𝑎(𝑠𝑖−1, ℎ𝑗)
𝛼𝑖,𝑗 =
exp 𝑒𝑖,𝑗
𝑘=1
𝑇
exp 𝑒𝑖,𝑘
𝑐𝑖 =
𝑗=1
𝑇
𝛼𝑖,𝑗ℎ𝑗
𝑠𝑖 = 𝑓(𝑠𝑖−1, 𝑦𝑖−1, 𝑐𝑖)
𝑦𝑖 = 𝑔(𝑠𝑖, 𝑐𝑖)
• When computing 𝑦𝑖, a soft attention-based decoder uses a learnable nonlinear function 𝑎(∙) to produce a scalar value 𝑒𝑖,𝑗
for each entry ℎ𝑗 in the memory based on ℎ𝑗 and the decoder’s state at the previous timestep 𝑠𝑖−1.
• 𝑎(∙) is a single-layer neural network using a 𝑡𝑎𝑛ℎ nonlinearity, but other functions such as a simple dot product between 𝑠𝑖−1 and ℎ𝑗
have been used.
• 𝑐𝑖 is the weighted sum of ℎ.
• Decoder updates its state to 𝑠𝑖 based on 𝑠𝑖−1 and 𝑐𝑖 and produces 𝑦𝑖.
• 𝑓(∙) is a RNN (one or more LSTM or GRU) and 𝑔(∙) is a learnable nonlinear function which maps the decoder state to the
output space.
Problem definitions in Attention mechanism in Seq2seq
• Problem
• A common criticism of soft attention is that the model must perform a pass over the entire input sequence when producing
each element of the output sequence.
• This results in the decoding process having complexity 𝑂(𝑇𝑈), where 𝑇 and 𝑈 are the input and output sequence lengths
respectively.
• Furthermore, because the entire sequence must be processed prior to outputting any symbols, soft attention cannot be used
in “online” settings where output sequence elements are produced when the input has only been partially observed.
Background: Monotonic Attention (MA)
• Soft attention mechanisms perform a pass over the entire input sequence when producing each element in the
output sequence.
• Authors [2] proposed an end-to-end differentiable method for learning monotonic alignments which, at test time,
enables computing attention online and in linear time.
Conventional soft attention
(e.g., additive, dot product) Monotonic attention
[2] Raffel, Colin, et al. "Online and Linear-Time Attention by Enforcing Monotonic Alignments." International Conference on Machine Learning. 2017.
Background: MA
• If for any output timestep 𝑖 it have 𝑧𝑖,𝑗 = 0 for 𝑗 ∈ {𝑡𝑖 − 1, . . . , 𝑇}, it can simply set 𝑐𝑖 to a vector of zeros.
• 𝑒𝑖,𝑗 = 𝑎(𝑠𝑖−1, ℎ𝑗)  𝑒𝑖,𝑗 = 𝑀𝑜𝑛𝑜𝑡𝑜𝑛𝑖𝑐𝐸𝑛𝑒𝑟𝑔𝑦(𝑠𝑖−1, ℎ𝑗)
• 𝑝𝑖,𝑗 = 𝜎(𝑒𝑖,𝑗)
• 𝑧𝑖,𝑗~Bernoulli(𝑝𝑖,𝑗)
• where 𝑎(·) is a learnable deterministic “energy function” and 𝜎(·) is the logistic sigmoid function.
• Note that the above model only needs ℎ 𝑘, 𝑘 ∈ {1, … , 𝑗} compute ℎ𝑗.
• Time complexity will be 𝑂(max 𝑇, 𝑈 )
• Through this mechanism, we know that MA explicitly processes the input sequence in a left-to-right order and makes a hard assignment of
𝑐_𝑖 to one particular encoder state denoted ℎ_(𝑡_𝑖 ).
Background: MA
• This model cannot train using back-propagation because of sampling.
• So that, this model use the expectation of ℎ𝑗 during training (inspired from soft alignments) and try to induce
discreteness into 𝑝𝑖,𝑗.
• The 𝛼𝑖,𝑗 defines the probability that input time step 𝑗 is attended at output time step 𝑖.
• 𝛼𝑖,𝑗 = 𝑃𝑖 ℎ𝑗 𝑢𝑠𝑒𝑑 = 𝑃𝑖 ℎ𝑗 𝑢𝑠𝑒𝑑 ℎ𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑)𝑃𝑖(ℎ𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑)
• 𝑃𝑖 ℎ𝑗 𝑢𝑠𝑒𝑑 ℎ𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑 = 𝑝𝑖,𝑗
• 𝑃𝑖 ℎ𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑 = 𝑃𝑖 ℎ𝑗−1 𝑛𝑜𝑡 𝑢𝑠𝑒𝑑, ℎ𝑗−1 𝑐ℎ𝑒𝑐𝑘𝑒𝑑 + 𝑃𝑖−1 ℎ𝑗 𝑢𝑠𝑒𝑑 𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑)
Background: MA
• Stepwise Probability at target step 𝑗
𝑝𝑖,𝑗 =
> 0.5
< 0.5
• Expected Alignment (Monotonic Attention):
𝛼𝑖,: = 𝑝𝑖,: 𝑐𝑢𝑚𝑝𝑟𝑜𝑑 1 − 𝑝𝑖,: 𝑐𝑢𝑚𝑠𝑢𝑚(
𝛼 𝑖−1,:
𝑐𝑢𝑚𝑝𝑟𝑜𝑑(1−𝑝 𝑖,:)
) (1)
Where 𝑐𝑢𝑚𝑝𝑟𝑜𝑑 𝑥 = 1, 𝑥1, 𝑥1 𝑥2, … , 𝑖
𝑥 −1
𝑥𝑖 and 𝑐𝑢𝑚𝑠𝑢𝑚 𝑥 = 1, 𝑥1, 𝑥1 + 𝑥2, … , 𝑖
|𝑥|
𝑥𝑖
Resulting process only computes at most max(𝑇, 𝑈) terms 𝑝𝑖,𝑗 which is a linear runtime.
Read 𝑖-th source token,
Write (𝑗 + 1)-th source token.
Problem definition: MA
• Problem definitions
• Although this MA achieves online linear time decoding, the decoder can only attend to one encoder state.
• This limitation can diminish translation quality as there may be insufficient information for reordering.
• Moreover, the backpropagation is not available in the hard monotonic attention.
Background: Monotonic Chunkwise Attention (MoChA)
• Hard monotonic alignments are just too hard in their conditions!
• Using only one vector ℎ 𝑡,𝑖 as context vector 𝑐𝑖 is a little too much constraint.
• A novel solution to this problem is Monotonic Chunkwise Mechanism. (MoChA) [10]
• This allows the model to use soft attention over fixed-size chunks(e.g., size 𝑤) of memory ending at input time step
𝑡𝑖 for each output time step 𝑖.
𝑢𝑖,𝑘 = ChunkEnergy(𝑠𝑖−1, ℎ 𝑘) = 𝑣 𝑇
tanh(𝑊𝑠 𝑠𝑖−1 + 𝑊ℎℎ𝑗 + 𝑏)
𝑐𝑖 = 𝑘=𝑣
𝑡𝑖 exp(𝑢𝑖,𝑘)
𝑙=𝑣
𝑡 𝑖 exp(𝑢𝑖,𝑙)
ℎ 𝑘
[10] Chiu, Chung-Cheng, and Colin Raffel. "Monotonic Chunkwise Attention." International Conference on Learning Representations. 2018.
Problem definition: Two major limitations of the hard monotonic mechanism.
• Problem definitions in conventional methods.
• Not enough context in the context vector.
• Assumption of strict monotonicity in input and output alignments.
• RNN-based networks.
Proposed architecture: Monotonic Multihead Attention (MMA)
• The Transformer [1] architecture has recently become the state-of-the-art for machine translation [14].
• An important feature of the Transformer is the use of a separate multihead attention module at each layer.
• Thus, this paper propose a new approach, Monotonic Multihead Attention (MMA), which combines the expressive
power of multihead attention and the low latency of monotonic attention.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
[14] Barrault, Loïc, et al. "Findings of the 2019 conference on machine translation (wmt19)." Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 2019.
Related works: Transformer
• Given queries 𝑄, keys 𝐾 and values 𝑉, multihead attention Multihead(𝑄, 𝐾, 𝑉) is defined as:
MultiHead(𝑄, 𝐾, 𝑉) = Concat(ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑 𝐻)𝑊 𝑂
where ℎ𝑒𝑎𝑑ℎ = Attention(𝑄𝑊ℎ
𝑄
, 𝐾𝑊ℎ
𝐾
, 𝑉𝑊ℎ
𝑉
)
• The attention function is the scaled dot-product attention, defined as:
Attention(𝑄, 𝐾, 𝑉) = Softmax(
𝑄𝐾 𝑇
𝑑 𝑘
)𝑉
• Now, we can know that multihead attention allows each decoder layer to have multiple heads, where each head can
compute a different attention distribution.
• No RNNs and CNNs in this network.
• Parallel method. (So fast!)
• Better performance in huge dataset.
Proposed architecture: MMA
• For a transformer with 𝐿 decoder layers and 𝐻 attention heads per layer, paper defines the selection process of the ℎ −th
head encoder-decoder attention in the 𝑙 −th decoder layer as:
𝑒𝑖,𝑗
𝑙,ℎ
= (
𝑚 𝑗 𝑊𝑙,ℎ
𝐾
𝑠 𝑖−1 𝑊𝑙,ℎ
𝑄 𝑇
𝑑 𝑘 𝑖
)𝑖,𝑗
𝑝𝑖,𝑗
𝑙,ℎ
= Sigmoid(𝑒𝑖,𝑗)
𝑧𝑖,𝑗
𝑙,ℎ
~ Bernoulli(𝑝𝑖,𝑗)
where 𝑊𝑙,ℎ is the input projection matrix, 𝑑 𝑘 is the dimension of the attention head.
Proposed architecture: MMA
• Independent stepwise selection probability: For layer 𝑙 and head ℎ
• 𝑝𝑖,𝑗
𝑙,ℎ
=
> 0.5
< 0.5
• Inference algorithm
• A source token is read if the fastest head decides to read.
• A target token is written if all the heads finish reading.
Layer 𝑙 head 𝑗 move one step forward
Layer 𝑙 head 𝑗 stop reading
Proposed architecture: MMA
• MMA-H
• Hard alignment.
• Potential for streaming.
• MMA-IL
• Infinite lookback.
• Good translation quality.
MMA-H
• For MMA-H(ard), this paper use Equation 1(in slide 24) in order to calculate the expected alignment for each layer
each head, given 𝑝𝑖,𝑗
𝑙,ℎ
.
• Each attention head in MMA-H attends to one encoder state.
• However there are more than two heads in each layer.
• Therefore, compared with the previously MA-based model, this MMA model is able to set attention to different positions.
• Even with the hard alignment variant (MMA-H), this model is still able to preserve the history information by setting
heads to past states.
MMA-IL
• For MMA-IL, authors calculate the Softmax energy for each head as follows:
• 𝑢𝑖,𝑗
𝑙,ℎ
= SoftEnergy =
𝑚 𝑗 𝑊𝑙,ℎ
𝐾
𝑠 𝑖−1 𝑊𝑙,ℎ
𝑄 𝑇
𝑑 𝑘
𝑖,𝑗
• And then allows the decoder to access encoder states from the beginning of the source sequence.
• Each attention head in MMA-IL can attend to all previous encoder states.
• So it takes more time than MMA-H.
• But quality of translation is better than MMA-H.
• MMA models use unidirectional encoders: the encoder self-attention can only attend to previous states, which is
also required for simultaneous translation.
MMAAlgorithm
Compare MMA-H to MMA-IL
• MMA-H
• This model is faster than MMA-IL  streaming.
• MMA-IL
• Better quality  good quality of translation.
• Thus, MMA-IL allows the model to leverage more information for translation, but MMA-H may be better suited for
streaming systems with stricter efficiency requirements!
Experiments
• Datasets
• IWSLT 2015 English-Vietnamese.
• WMT 2016 English-German.
• Latency Metrics
• Average Proportion (AP) [3].
• Average Lagging (AL) [4].
• Differentiable Average Lagging (DAL) [5].
• Quality Metrics
• BLEU score.
[3] Arivazhagan, Naveen, et al. "Monotonic Infinite Lookback Attention for Simultaneous Machine Translation." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
[4] Cho, Kyunghyun, and Masha Esipova. "Can neural machine translation do simultaneous translation?." arXiv preprint arXiv:1606.02012 (2016).
[5] Ma, Mingbo, et al. "STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework." Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. 2019.
Model specification
Results: Latency-quality for MILk and MMA on IWSLT15 En-Vi and WMT15 De-EN.
• The BLEU and latency scores on the test set are generated by setting a latency range and selecting the checkpoint with best
BLEU score on the validation set.
• Black dashed line indicates the unidirectional offline transformer model with greedy search.
• While MMA-IL tends to have a decrease in quality as the latency decreases, MMA-H has a small gain in quality as latency
decreases: a larger latency does not necessarily mean an increase in source information available to the model.
• In fact, the large latency is from the outlier attention heads, which skip the entire source sentence and point to the end of the
sentence.
Results
• Note that latency increases with the number of attention heads.
• With 6 layers, the best performance is reached with 16 heads.
Conclusion
Summary
• This paper proposed two variants of the monotonic multihead attention model for simultaneous machine translation.
• Introduced two new targeted loss terms for latency control.
• Achieved better latency-quality trade-offs than the previous state-of-the-art model.
Reference
• [1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
• [2] Raffel, Colin, et al. "Online and Linear-Time Attention by Enforcing Monotonic Alignments." International
Conference on Machine Learning. 2017.
• [3] Arivazhagan, Naveen, et al. "Monotonic Infinite Lookback Attention for Simultaneous Machine Translation."
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
• [4] Cho, Kyunghyun, and Masha Esipova. "Can neural machine translation do simultaneous translation?." arXiv preprint
arXiv:1606.02012 (2016).
• [5] Ma, Mingbo, et al. "STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using
Prefix-to-Prefix Framework." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
2019.
• [6] Dalvi, Fahim, et al. "Incremental Decoding and Training Methods for Simultaneous Translation in Neural Machine
Translation." Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018.
• [7] Grissom II, Alvin, et al. "Don’t until the final verb wait: Reinforcement learning for simultaneous machine
translation." Proceedings of the 2014 Conference on empirical methods in natural language processing (EMNLP). 2014.
• [8] Gu, Jiatao, et al. "Learning to translate in real-time with neural machine translation." arXiv preprint
arXiv:1610.00388 (2016).
Reference
• [9] Luo, Yuping, et al. "Learning online alignments with continuous rewards policy gradient." 2017 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.
• [10] Chiu, Chung-Cheng, and Colin Raffel. "Monotonic Chunkwise Attention." International Conference on
Learning Representations. 2018.
• [11] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks."
Advances in neural information processing systems. 2014.
• [12] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to
align and translate." arXiv preprint arXiv:1409.0473 (2014).
• [13] Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective approaches to attention-based
neural machine translation." arXiv preprint arXiv:1508.04025 (2015).
• [14] Barrault, Loïc, et al. "Findings of the 2019 conference on machine translation (wmt19)." Proceedings of the
Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 2019.
Appendix: Conventional attention mechanism
• Bahdanau attention (additive attention)
𝑐𝑡 = 𝑗=1
𝑇𝑥
𝑎 𝑡𝑗ℎ𝑗
 𝐻𝑎 𝑡
𝑎 𝑡 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑆𝑐𝑜𝑟𝑒 𝑠𝑡−1, ℎ𝑗
𝑗=1
𝑇𝑥
∈ ℝ 𝑇𝑥
𝑆𝑐𝑜𝑟𝑒 𝑠𝑡−1, ℎ𝑗 = 𝑣 𝑇tanh(𝑊𝑎 𝑠𝑡−1 + 𝑈 𝑎ℎ𝑗)
Where 𝑠𝑡 is hidden state vector, 𝑐𝑡 is context vector,
𝑦𝑡−1 is input of current time step.
Appendix: Conventional attention mechanism
• Luong attention (dot product attention) 𝑐𝑡 = 𝑗=1
𝑇𝑥
𝑎 𝑡𝑗ℎ𝑗
 𝐻𝑎 𝑡
𝑎 𝑡 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑆𝑐𝑜𝑟𝑒 𝑠𝑡, ℎ𝑗
𝑗=1
𝑇𝑥
∈ ℝ 𝑇𝑥
𝑦𝑡 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑊𝑦 𝑠𝑡 + 𝑏 𝑦
𝑠 = tanh(𝑊𝑠𝑠 𝑠𝑡 + 𝑊𝑐𝑠 𝑐𝑡 + 𝑏𝑠)
The different things from Bahdanau attention is:
- Using 𝑠𝑡 instead of 𝑠𝑡−1
- It use 𝑐𝑡
Computation path in the Luong attention is further
simplified because the part for obtaining the
output and the part for performing the recursive
operation of the RNN can be separated.
Appendix: Soft monotonic attention decoder algorithm (training phase)
Appendix: Hard monotonic attention decoder algorithm (evaluation phase)
More information were in the paper.
• Expected delay, for layer 𝑙 head ℎ
𝑔𝑖
𝑙,ℎ
= 𝑗=1
|𝑥|
𝑗𝛼𝑖,𝑗
ℎ,𝑙
• Weighted average loss
𝑔𝑖
𝑊
=
exp(𝑔𝑖
𝑙,ℎ
)
𝑙=1
𝐿
ℎ=1
𝐻 exp(𝑔𝑖
𝑙,ℎ
)
𝑔𝑗
𝑙,ℎ
 𝐿 𝑎𝑣𝑔 = 𝐶(𝑔 𝑊
)
• Differentiable average lagging
𝐿 𝑣𝑎𝑟 =
1
|𝑦|𝐿𝐻 𝑖=1
|𝑦|
𝑙=1
𝐿
ℎ=1
𝐻
𝑔𝑖
𝑙,ℎ
− 𝑔𝑖
2
Appendix: Latency and Attention Span Control
Appendix: Latency Metric Calculation
• Average Proportion
1
𝑥 |𝑦|
𝑖=1
|𝑦|
𝑔𝑖
• Average Lagging
1
𝜏
𝑖=1
𝜏
𝑔𝑖 −
𝑖 − 1
𝑦 /|𝑥|
𝜏 = argmax𝑖(𝑔𝑖 = 𝑥 )
• Differential Average Lagging
1
|𝑦|
𝑖=1
|𝑦|
𝑔𝑖
′
−
𝑖 − 1
𝑦 /|𝑥|
𝑔𝑖
′
=
𝑔𝑖 , 𝑖 = 0
max 𝑔𝑖, 𝑔𝑖−1
′
+
𝑦
𝑥
, 𝑖 < 0
where 𝑥=source, 𝑦=target, 𝑔=delays
Thank you!

More Related Content

What's hot

指数時間アルゴリズムの最先端
指数時間アルゴリズムの最先端指数時間アルゴリズムの最先端
指数時間アルゴリズムの最先端Yoichi Iwata
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
Seiya Tokui
 
相互情報量を用いた独立性の検定
相互情報量を用いた独立性の検定相互情報量を用いた独立性の検定
相互情報量を用いた独立性の検定
Joe Suzuki
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
Roberto Pereira Silveira
 
Confidence Weightedで ランク学習を実装してみた
Confidence Weightedで ランク学習を実装してみたConfidence Weightedで ランク学習を実装してみた
Confidence Weightedで ランク学習を実装してみたtkng
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
Yao-Chieh Hu
 
Genetic Algorithms Made Easy
Genetic Algorithms Made EasyGenetic Algorithms Made Easy
Genetic Algorithms Made Easy
Prakash Pimpale
 
Graph Neural Network in practice
Graph Neural Network in practiceGraph Neural Network in practice
Graph Neural Network in practice
tuxette
 
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
Euijin Jeong
 
CMA-ESサンプラーによるハイパーパラメータ最適化 at Optuna Meetup #1
CMA-ESサンプラーによるハイパーパラメータ最適化 at Optuna Meetup #1CMA-ESサンプラーによるハイパーパラメータ最適化 at Optuna Meetup #1
CMA-ESサンプラーによるハイパーパラメータ最適化 at Optuna Meetup #1
Masashi Shibata
 
Masked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision LearnersMasked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision Learners
GuoqingLiu9
 
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
【DL輪読会】Incorporating group update for speech enhancement  based on convolutio...【DL輪読会】Incorporating group update for speech enhancement  based on convolutio...
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
Deep Learning JP
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
ananth
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
Amit Kumar Rathi
 
18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
Pier Luca Lanzi
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
Anuj Gupta
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
Masahiro Suzuki
 
言葉のもつ広がりを、モデルの学習に活かそう -one-hot to distribution in language modeling-
言葉のもつ広がりを、モデルの学習に活かそう -one-hot to distribution in language modeling-言葉のもつ広がりを、モデルの学習に活かそう -one-hot to distribution in language modeling-
言葉のもつ広がりを、モデルの学習に活かそう -one-hot to distribution in language modeling-
Takahiro Kubo
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
Grigory Sapunov
 
Pegasus
PegasusPegasus
Pegasus
Hangil Kim
 

What's hot (20)

指数時間アルゴリズムの最先端
指数時間アルゴリズムの最先端指数時間アルゴリズムの最先端
指数時間アルゴリズムの最先端
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
相互情報量を用いた独立性の検定
相互情報量を用いた独立性の検定相互情報量を用いた独立性の検定
相互情報量を用いた独立性の検定
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
 
Confidence Weightedで ランク学習を実装してみた
Confidence Weightedで ランク学習を実装してみたConfidence Weightedで ランク学習を実装してみた
Confidence Weightedで ランク学習を実装してみた
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
 
Genetic Algorithms Made Easy
Genetic Algorithms Made EasyGenetic Algorithms Made Easy
Genetic Algorithms Made Easy
 
Graph Neural Network in practice
Graph Neural Network in practiceGraph Neural Network in practice
Graph Neural Network in practice
 
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
 
CMA-ESサンプラーによるハイパーパラメータ最適化 at Optuna Meetup #1
CMA-ESサンプラーによるハイパーパラメータ最適化 at Optuna Meetup #1CMA-ESサンプラーによるハイパーパラメータ最適化 at Optuna Meetup #1
CMA-ESサンプラーによるハイパーパラメータ最適化 at Optuna Meetup #1
 
Masked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision LearnersMasked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision Learners
 
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
【DL輪読会】Incorporating group update for speech enhancement  based on convolutio...【DL輪読会】Incorporating group update for speech enhancement  based on convolutio...
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
 
言葉のもつ広がりを、モデルの学習に活かそう -one-hot to distribution in language modeling-
言葉のもつ広がりを、モデルの学習に活かそう -one-hot to distribution in language modeling-言葉のもつ広がりを、モデルの学習に活かそう -one-hot to distribution in language modeling-
言葉のもつ広がりを、モデルの学習に活かそう -one-hot to distribution in language modeling-
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Pegasus
PegasusPegasus
Pegasus
 

Similar to Monotonic Multihead Attention review

Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN review
June-Woo Kim
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
Yuta Niki
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Universitat Politècnica de Catalunya
 
Conformer review
Conformer reviewConformer review
Conformer review
June-Woo Kim
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
Sudeep Das, Ph.D.
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
Jeong-Gwan Lee
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
Karthik Murugesan
 
Use CNN for Sequence Modeling
Use CNN for Sequence ModelingUse CNN for Sequence Modeling
Use CNN for Sequence Modeling
Dongang (Sean) Wang
 
Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksSequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural Networks
Nguyen Quang
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
June-Woo Kim
 
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
Rashid Mijumbi
 
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Universitat Politècnica de Catalunya
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptx
KtonNguyn2
 
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
AI Frontiers
 
Unifi
Unifi Unifi
Unifi
hangal
 
Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?
NAVER Engineering
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
changedaeoh
 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Isabelle Augenstein
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
TuCaoMinh2
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer Calixto
Sebastian Ruder
 

Similar to Monotonic Multihead Attention review (20)

Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN review
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
 
Conformer review
Conformer reviewConformer review
Conformer review
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
Use CNN for Sequence Modeling
Use CNN for Sequence ModelingUse CNN for Sequence Modeling
Use CNN for Sequence Modeling
 
Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksSequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural Networks
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
 
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
 
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptx
 
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
 
Unifi
Unifi Unifi
Unifi
 
Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer Calixto
 

Recently uploaded

Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMTIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
HODECEDSIET
 
Recycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part IIRecycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part II
Aditya Rajan Patra
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
IJNSA Journal
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
University of Maribor
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
Las Vegas Warehouse
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
rpskprasana
 
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball playEric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
enizeyimana36
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
gerogepatton
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 

Recently uploaded (20)

Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMTIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
 
Recycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part IIRecycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part II
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
 
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball playEric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 

Monotonic Multihead Attention review

  • 1. Monotonic Multihead Attention Presented by: June-Woo Kim ABR lab, Department of Artificial Intelligence, Kyungpook National University 2, Dec. 2020. Xutai Ma et al. Facebook, Johns Hopkins Univ. ICLR 2020
  • 2. Presentation contents • Abstract • Background • Problem definition • Proposed architecture • Experiments and results • Reference • Appendix
  • 3. Abstract • This paper extends previous models for monotonic attention to the multi-head attention used in Transformers [1], yielding “Monotonic Multi-head Attention”. • The proposed method is a relatively straightforward extension of the previous Hard [2] and Infinite Lookback [3] monotonic attention models. • This paper achieves better latency-quality tradeoffs in simultaneous Machine Translation tasks in two language pairs. • Also, this paper is a meaningful contribution to the task of simultaneously Machine Translation. [1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. [2] Raffel, Colin, et al. "Online and Linear-Time Attention by Enforcing Monotonic Alignments." International Conference on Machine Learning. 2017. [3] Arivazhagan, Naveen, et al. "Monotonic Infinite Lookback Attention for Simultaneous Machine Translation." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
  • 4. Background: Simultaneous Translation • Start translating before reading the whole source sentence. • Applications • Video Subtitle Translation. • International Conferences. • Personal Translation Assistant. • Goal: We do not want to wait for the completion of the full sentence before we start translation!
  • 5. Background: Simultaneous Translation • Same grammar structure case  Simple! • English: • A republican strategy to counter the re-election of Obama. (SVO) • Spanish: • Una estrategia republicana para contrarrestar la reelección de Obama. (SVO) • S: subject, O: object, V: verb Action English Spanish Read A Republican strategy Write Una estrategia republicana Read to counter Write para contrarrestar Read the re-election of Obama Write la reelección de Obama
  • 6. Background: Simultaneous Translation • Different grammatical structure case  More complex • English: • A republican strategy to counter the re-election of Obama. (SVO) • Korean: • Obama 재선을 대응하기 위한 공화당 전략. (SOV) • S: subject, O: object, V: verb Action English German Read A Republican strategy Write 공화당 전략 Read to counter Read the re-election of Obama Write 대응하다 Write Obama 재선
  • 7. Background on the problem: Current Approaches 1. Fixed Policy – Weaker performance • Wait If-* [4] policy. (Rule-based) • Wait-k [5] policy. (Use 𝑘 tokens) • Incremental decoding [6]. (Incremental learning) 2. Reinforcement Learning – Less stable learning process. • Markov chain [7] (Markov chain with RL) • Make decisions on when to translate from the interaction [8]. (Used pre-trained offline model to teach agent) • Continuous rewards policy [9]. (Rewards policy gradient for online alignments) [4] Cho, Kyunghyun, and Masha Esipova. "Can neural machine translation do simultaneous translation?." arXiv preprint arXiv:1606.02012 (2016). [5] Ma, Mingbo, et al. "STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. [6] Dalvi, Fahim, et al. "Incremental Decoding and Training Methods for Simultaneous Translation in Neural Machine Translation." Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018. [7] Grissom II, Alvin, et al. "Don’t until the final verb wait: Reinforcement learning for simultaneous machine translation." Proceedings of the 2014 Conference on empirical methods in natural language processing (EMNLP). 2014. [8] Gu, Jiatao, et al. "Learning to translate in real-time with neural machine translation." arXiv preprint arXiv:1610.00388 (2016). [9] Luo, Yuping, et al. "Learning online alignments with continuous rewards policy gradient." 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.
  • 8. Background on the problem: Current Approaches 3. Monotonic Attention (MA) – State of the art. • Hard Monotonic Attention [2]. (First introduction of concept of MA) • Monotonic Chunkwise Attention (MoChA) [10]. (Let the model attend to a chunk of encoder states) • Monotonic Infinite Lookback Attention (MILk) [3]. (Introduced infinite lookback to improve the quality) • Let’s focus and glance this mechanism! [2] Raffel, Colin, et al. "Online and Linear-Time Attention by Enforcing Monotonic Alignments." International Conference on Machine Learning. 2017. [10] Chiu, Chung-Cheng, and Colin Raffel. "Monotonic Chunkwise Attention." International Conference on Learning Representations. 2018. [3] Arivazhagan, Naveen, et al. "Monotonic Infinite Lookback Attention for Simultaneous Machine Translation." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
  • 9. Background: Sequence-to-sequence • Recently, the “Sequence-to-sequence” (Seq2Seq) [11] framework has facilitated the use of RNNs on sequence transduction problems such as machine translation and speech recognition. • Encoder: input sequence is processed with some networks. (e.g., RNNs, CNNs, FC-layers, hybrid, etc.) • Decoder: produce the target sequence with output of the encoder. (almost RNNs) • This often results in the model having difficulty to longer sequences than those seen during training. [11] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014. Figure reference: https://brunch.co.kr/@kakao-it/155
  • 10. Background: Attention mechanisms in Seq2seq • Attention mechanism in Seq2seq • Encoder: produces a sequence of hidden states which correspond to entries in the input sequence. • Decoder: allow to refer back to any of the encoder states as it produces its output. • In Seq2seq with attention, the encoder produces a sequence of hidden states which correspond to entries in the input sequence! • There are some effective attention mechanism with Seq2seq • Bahdanau attention [12]. (additive attention) • Luong attention [13]. (dot-product attention) • Multi-head attention [1]. (scaled dot-product attention with multi-head. Each heads do scaled dot-production, and those separated from number of heads) [12] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014). [13] Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective approaches to attention-based neural machine translation." arXiv preprint arXiv:1508.04025 (2015). [1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
  • 11. Background: Soft attention • Encoder RNN do the given input sequence 𝑥 = {𝑥1, … , 𝑥 𝑇} to produce a sequence of hidden states ℎ = {ℎ1, … , ℎ 𝑇}. • Referring to ℎ as the “memory” to emphasize its connection to memory-augmented neural networks • Decoder RNN produces an output sequence 𝑦 = 𝑦1, … , 𝑦 𝑈 , conditioned on the memory.
  • 12. Background: Soft attention (additive attention) 𝑒𝑖,𝑗 = 𝑎(𝑠𝑖−1, ℎ𝑗) 𝛼𝑖,𝑗 = exp 𝑒𝑖,𝑗 𝑘=1 𝑇 exp 𝑒𝑖,𝑘 𝑐𝑖 = 𝑗=1 𝑇 𝛼𝑖,𝑗ℎ𝑗 𝑠𝑖 = 𝑓(𝑠𝑖−1, 𝑦𝑖−1, 𝑐𝑖) 𝑦𝑖 = 𝑔(𝑠𝑖, 𝑐𝑖) • When computing 𝑦𝑖, a soft attention-based decoder uses a learnable nonlinear function 𝑎(∙) to produce a scalar value 𝑒𝑖,𝑗 for each entry ℎ𝑗 in the memory based on ℎ𝑗 and the decoder’s state at the previous timestep 𝑠𝑖−1. • 𝑎(∙) is a single-layer neural network using a 𝑡𝑎𝑛ℎ nonlinearity, but other functions such as a simple dot product between 𝑠𝑖−1 and ℎ𝑗 have been used. • 𝑐𝑖 is the weighted sum of ℎ. • Decoder updates its state to 𝑠𝑖 based on 𝑠𝑖−1 and 𝑐𝑖 and produces 𝑦𝑖. • 𝑓(∙) is a RNN (one or more LSTM or GRU) and 𝑔(∙) is a learnable nonlinear function which maps the decoder state to the output space.
  • 13. Problem definitions in Attention mechanism in Seq2seq • Problem • A common criticism of soft attention is that the model must perform a pass over the entire input sequence when producing each element of the output sequence. • This results in the decoding process having complexity 𝑂(𝑇𝑈), where 𝑇 and 𝑈 are the input and output sequence lengths respectively. • Furthermore, because the entire sequence must be processed prior to outputting any symbols, soft attention cannot be used in “online” settings where output sequence elements are produced when the input has only been partially observed.
  • 14. Background: Monotonic Attention (MA) • Soft attention mechanisms perform a pass over the entire input sequence when producing each element in the output sequence. • Authors [2] proposed an end-to-end differentiable method for learning monotonic alignments which, at test time, enables computing attention online and in linear time. Conventional soft attention (e.g., additive, dot product) Monotonic attention [2] Raffel, Colin, et al. "Online and Linear-Time Attention by Enforcing Monotonic Alignments." International Conference on Machine Learning. 2017.
  • 15. Background: MA • If for any output timestep 𝑖 it have 𝑧𝑖,𝑗 = 0 for 𝑗 ∈ {𝑡𝑖 − 1, . . . , 𝑇}, it can simply set 𝑐𝑖 to a vector of zeros. • 𝑒𝑖,𝑗 = 𝑎(𝑠𝑖−1, ℎ𝑗)  𝑒𝑖,𝑗 = 𝑀𝑜𝑛𝑜𝑡𝑜𝑛𝑖𝑐𝐸𝑛𝑒𝑟𝑔𝑦(𝑠𝑖−1, ℎ𝑗) • 𝑝𝑖,𝑗 = 𝜎(𝑒𝑖,𝑗) • 𝑧𝑖,𝑗~Bernoulli(𝑝𝑖,𝑗) • where 𝑎(·) is a learnable deterministic “energy function” and 𝜎(·) is the logistic sigmoid function. • Note that the above model only needs ℎ 𝑘, 𝑘 ∈ {1, … , 𝑗} compute ℎ𝑗. • Time complexity will be 𝑂(max 𝑇, 𝑈 ) • Through this mechanism, we know that MA explicitly processes the input sequence in a left-to-right order and makes a hard assignment of 𝑐_𝑖 to one particular encoder state denoted ℎ_(𝑡_𝑖 ).
  • 16. Background: MA • This model cannot train using back-propagation because of sampling. • So that, this model use the expectation of ℎ𝑗 during training (inspired from soft alignments) and try to induce discreteness into 𝑝𝑖,𝑗. • The 𝛼𝑖,𝑗 defines the probability that input time step 𝑗 is attended at output time step 𝑖. • 𝛼𝑖,𝑗 = 𝑃𝑖 ℎ𝑗 𝑢𝑠𝑒𝑑 = 𝑃𝑖 ℎ𝑗 𝑢𝑠𝑒𝑑 ℎ𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑)𝑃𝑖(ℎ𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑) • 𝑃𝑖 ℎ𝑗 𝑢𝑠𝑒𝑑 ℎ𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑 = 𝑝𝑖,𝑗 • 𝑃𝑖 ℎ𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑 = 𝑃𝑖 ℎ𝑗−1 𝑛𝑜𝑡 𝑢𝑠𝑒𝑑, ℎ𝑗−1 𝑐ℎ𝑒𝑐𝑘𝑒𝑑 + 𝑃𝑖−1 ℎ𝑗 𝑢𝑠𝑒𝑑 𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑)
  • 17. Background: MA • Stepwise Probability at target step 𝑗 𝑝𝑖,𝑗 = > 0.5 < 0.5 • Expected Alignment (Monotonic Attention): 𝛼𝑖,: = 𝑝𝑖,: 𝑐𝑢𝑚𝑝𝑟𝑜𝑑 1 − 𝑝𝑖,: 𝑐𝑢𝑚𝑠𝑢𝑚( 𝛼 𝑖−1,: 𝑐𝑢𝑚𝑝𝑟𝑜𝑑(1−𝑝 𝑖,:) ) (1) Where 𝑐𝑢𝑚𝑝𝑟𝑜𝑑 𝑥 = 1, 𝑥1, 𝑥1 𝑥2, … , 𝑖 𝑥 −1 𝑥𝑖 and 𝑐𝑢𝑚𝑠𝑢𝑚 𝑥 = 1, 𝑥1, 𝑥1 + 𝑥2, … , 𝑖 |𝑥| 𝑥𝑖 Resulting process only computes at most max(𝑇, 𝑈) terms 𝑝𝑖,𝑗 which is a linear runtime. Read 𝑖-th source token, Write (𝑗 + 1)-th source token.
  • 18. Problem definition: MA • Problem definitions • Although this MA achieves online linear time decoding, the decoder can only attend to one encoder state. • This limitation can diminish translation quality as there may be insufficient information for reordering. • Moreover, the backpropagation is not available in the hard monotonic attention.
  • 19. Background: Monotonic Chunkwise Attention (MoChA) • Hard monotonic alignments are just too hard in their conditions! • Using only one vector ℎ 𝑡,𝑖 as context vector 𝑐𝑖 is a little too much constraint. • A novel solution to this problem is Monotonic Chunkwise Mechanism. (MoChA) [10] • This allows the model to use soft attention over fixed-size chunks(e.g., size 𝑤) of memory ending at input time step 𝑡𝑖 for each output time step 𝑖. 𝑢𝑖,𝑘 = ChunkEnergy(𝑠𝑖−1, ℎ 𝑘) = 𝑣 𝑇 tanh(𝑊𝑠 𝑠𝑖−1 + 𝑊ℎℎ𝑗 + 𝑏) 𝑐𝑖 = 𝑘=𝑣 𝑡𝑖 exp(𝑢𝑖,𝑘) 𝑙=𝑣 𝑡 𝑖 exp(𝑢𝑖,𝑙) ℎ 𝑘 [10] Chiu, Chung-Cheng, and Colin Raffel. "Monotonic Chunkwise Attention." International Conference on Learning Representations. 2018.
  • 20. Problem definition: Two major limitations of the hard monotonic mechanism. • Problem definitions in conventional methods. • Not enough context in the context vector. • Assumption of strict monotonicity in input and output alignments. • RNN-based networks.
  • 21. Proposed architecture: Monotonic Multihead Attention (MMA) • The Transformer [1] architecture has recently become the state-of-the-art for machine translation [14]. • An important feature of the Transformer is the use of a separate multihead attention module at each layer. • Thus, this paper propose a new approach, Monotonic Multihead Attention (MMA), which combines the expressive power of multihead attention and the low latency of monotonic attention. [1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. [14] Barrault, Loïc, et al. "Findings of the 2019 conference on machine translation (wmt19)." Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 2019.
  • 22. Related works: Transformer • Given queries 𝑄, keys 𝐾 and values 𝑉, multihead attention Multihead(𝑄, 𝐾, 𝑉) is defined as: MultiHead(𝑄, 𝐾, 𝑉) = Concat(ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑 𝐻)𝑊 𝑂 where ℎ𝑒𝑎𝑑ℎ = Attention(𝑄𝑊ℎ 𝑄 , 𝐾𝑊ℎ 𝐾 , 𝑉𝑊ℎ 𝑉 ) • The attention function is the scaled dot-product attention, defined as: Attention(𝑄, 𝐾, 𝑉) = Softmax( 𝑄𝐾 𝑇 𝑑 𝑘 )𝑉 • Now, we can know that multihead attention allows each decoder layer to have multiple heads, where each head can compute a different attention distribution. • No RNNs and CNNs in this network. • Parallel method. (So fast!) • Better performance in huge dataset.
  • 23. Proposed architecture: MMA • For a transformer with 𝐿 decoder layers and 𝐻 attention heads per layer, paper defines the selection process of the ℎ −th head encoder-decoder attention in the 𝑙 −th decoder layer as: 𝑒𝑖,𝑗 𝑙,ℎ = ( 𝑚 𝑗 𝑊𝑙,ℎ 𝐾 𝑠 𝑖−1 𝑊𝑙,ℎ 𝑄 𝑇 𝑑 𝑘 𝑖 )𝑖,𝑗 𝑝𝑖,𝑗 𝑙,ℎ = Sigmoid(𝑒𝑖,𝑗) 𝑧𝑖,𝑗 𝑙,ℎ ~ Bernoulli(𝑝𝑖,𝑗) where 𝑊𝑙,ℎ is the input projection matrix, 𝑑 𝑘 is the dimension of the attention head.
  • 24. Proposed architecture: MMA • Independent stepwise selection probability: For layer 𝑙 and head ℎ • 𝑝𝑖,𝑗 𝑙,ℎ = > 0.5 < 0.5 • Inference algorithm • A source token is read if the fastest head decides to read. • A target token is written if all the heads finish reading. Layer 𝑙 head 𝑗 move one step forward Layer 𝑙 head 𝑗 stop reading
  • 25. Proposed architecture: MMA • MMA-H • Hard alignment. • Potential for streaming. • MMA-IL • Infinite lookback. • Good translation quality.
  • 26. MMA-H • For MMA-H(ard), this paper use Equation 1(in slide 24) in order to calculate the expected alignment for each layer each head, given 𝑝𝑖,𝑗 𝑙,ℎ . • Each attention head in MMA-H attends to one encoder state. • However there are more than two heads in each layer. • Therefore, compared with the previously MA-based model, this MMA model is able to set attention to different positions. • Even with the hard alignment variant (MMA-H), this model is still able to preserve the history information by setting heads to past states.
  • 27. MMA-IL • For MMA-IL, authors calculate the Softmax energy for each head as follows: • 𝑢𝑖,𝑗 𝑙,ℎ = SoftEnergy = 𝑚 𝑗 𝑊𝑙,ℎ 𝐾 𝑠 𝑖−1 𝑊𝑙,ℎ 𝑄 𝑇 𝑑 𝑘 𝑖,𝑗 • And then allows the decoder to access encoder states from the beginning of the source sequence. • Each attention head in MMA-IL can attend to all previous encoder states. • So it takes more time than MMA-H. • But quality of translation is better than MMA-H. • MMA models use unidirectional encoders: the encoder self-attention can only attend to previous states, which is also required for simultaneous translation.
  • 29. Compare MMA-H to MMA-IL • MMA-H • This model is faster than MMA-IL  streaming. • MMA-IL • Better quality  good quality of translation. • Thus, MMA-IL allows the model to leverage more information for translation, but MMA-H may be better suited for streaming systems with stricter efficiency requirements!
  • 30. Experiments • Datasets • IWSLT 2015 English-Vietnamese. • WMT 2016 English-German. • Latency Metrics • Average Proportion (AP) [3]. • Average Lagging (AL) [4]. • Differentiable Average Lagging (DAL) [5]. • Quality Metrics • BLEU score. [3] Arivazhagan, Naveen, et al. "Monotonic Infinite Lookback Attention for Simultaneous Machine Translation." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. [4] Cho, Kyunghyun, and Masha Esipova. "Can neural machine translation do simultaneous translation?." arXiv preprint arXiv:1606.02012 (2016). [5] Ma, Mingbo, et al. "STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
  • 32. Results: Latency-quality for MILk and MMA on IWSLT15 En-Vi and WMT15 De-EN. • The BLEU and latency scores on the test set are generated by setting a latency range and selecting the checkpoint with best BLEU score on the validation set. • Black dashed line indicates the unidirectional offline transformer model with greedy search. • While MMA-IL tends to have a decrease in quality as the latency decreases, MMA-H has a small gain in quality as latency decreases: a larger latency does not necessarily mean an increase in source information available to the model. • In fact, the large latency is from the outlier attention heads, which skip the entire source sentence and point to the end of the sentence.
  • 33. Results • Note that latency increases with the number of attention heads. • With 6 layers, the best performance is reached with 16 heads.
  • 34. Conclusion Summary • This paper proposed two variants of the monotonic multihead attention model for simultaneous machine translation. • Introduced two new targeted loss terms for latency control. • Achieved better latency-quality trade-offs than the previous state-of-the-art model.
  • 35. Reference • [1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. • [2] Raffel, Colin, et al. "Online and Linear-Time Attention by Enforcing Monotonic Alignments." International Conference on Machine Learning. 2017. • [3] Arivazhagan, Naveen, et al. "Monotonic Infinite Lookback Attention for Simultaneous Machine Translation." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. • [4] Cho, Kyunghyun, and Masha Esipova. "Can neural machine translation do simultaneous translation?." arXiv preprint arXiv:1606.02012 (2016). • [5] Ma, Mingbo, et al. "STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. • [6] Dalvi, Fahim, et al. "Incremental Decoding and Training Methods for Simultaneous Translation in Neural Machine Translation." Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018. • [7] Grissom II, Alvin, et al. "Don’t until the final verb wait: Reinforcement learning for simultaneous machine translation." Proceedings of the 2014 Conference on empirical methods in natural language processing (EMNLP). 2014. • [8] Gu, Jiatao, et al. "Learning to translate in real-time with neural machine translation." arXiv preprint arXiv:1610.00388 (2016).
  • 36. Reference • [9] Luo, Yuping, et al. "Learning online alignments with continuous rewards policy gradient." 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017. • [10] Chiu, Chung-Cheng, and Colin Raffel. "Monotonic Chunkwise Attention." International Conference on Learning Representations. 2018. • [11] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014. • [12] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014). • [13] Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective approaches to attention-based neural machine translation." arXiv preprint arXiv:1508.04025 (2015). • [14] Barrault, Loïc, et al. "Findings of the 2019 conference on machine translation (wmt19)." Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 2019.
  • 37. Appendix: Conventional attention mechanism • Bahdanau attention (additive attention) 𝑐𝑡 = 𝑗=1 𝑇𝑥 𝑎 𝑡𝑗ℎ𝑗  𝐻𝑎 𝑡 𝑎 𝑡 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑆𝑐𝑜𝑟𝑒 𝑠𝑡−1, ℎ𝑗 𝑗=1 𝑇𝑥 ∈ ℝ 𝑇𝑥 𝑆𝑐𝑜𝑟𝑒 𝑠𝑡−1, ℎ𝑗 = 𝑣 𝑇tanh(𝑊𝑎 𝑠𝑡−1 + 𝑈 𝑎ℎ𝑗) Where 𝑠𝑡 is hidden state vector, 𝑐𝑡 is context vector, 𝑦𝑡−1 is input of current time step.
  • 38. Appendix: Conventional attention mechanism • Luong attention (dot product attention) 𝑐𝑡 = 𝑗=1 𝑇𝑥 𝑎 𝑡𝑗ℎ𝑗  𝐻𝑎 𝑡 𝑎 𝑡 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑆𝑐𝑜𝑟𝑒 𝑠𝑡, ℎ𝑗 𝑗=1 𝑇𝑥 ∈ ℝ 𝑇𝑥 𝑦𝑡 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑊𝑦 𝑠𝑡 + 𝑏 𝑦 𝑠 = tanh(𝑊𝑠𝑠 𝑠𝑡 + 𝑊𝑐𝑠 𝑐𝑡 + 𝑏𝑠) The different things from Bahdanau attention is: - Using 𝑠𝑡 instead of 𝑠𝑡−1 - It use 𝑐𝑡 Computation path in the Luong attention is further simplified because the part for obtaining the output and the part for performing the recursive operation of the RNN can be separated.
  • 39. Appendix: Soft monotonic attention decoder algorithm (training phase)
  • 40. Appendix: Hard monotonic attention decoder algorithm (evaluation phase)
  • 41. More information were in the paper. • Expected delay, for layer 𝑙 head ℎ 𝑔𝑖 𝑙,ℎ = 𝑗=1 |𝑥| 𝑗𝛼𝑖,𝑗 ℎ,𝑙 • Weighted average loss 𝑔𝑖 𝑊 = exp(𝑔𝑖 𝑙,ℎ ) 𝑙=1 𝐿 ℎ=1 𝐻 exp(𝑔𝑖 𝑙,ℎ ) 𝑔𝑗 𝑙,ℎ  𝐿 𝑎𝑣𝑔 = 𝐶(𝑔 𝑊 ) • Differentiable average lagging 𝐿 𝑣𝑎𝑟 = 1 |𝑦|𝐿𝐻 𝑖=1 |𝑦| 𝑙=1 𝐿 ℎ=1 𝐻 𝑔𝑖 𝑙,ℎ − 𝑔𝑖 2 Appendix: Latency and Attention Span Control
  • 42. Appendix: Latency Metric Calculation • Average Proportion 1 𝑥 |𝑦| 𝑖=1 |𝑦| 𝑔𝑖 • Average Lagging 1 𝜏 𝑖=1 𝜏 𝑔𝑖 − 𝑖 − 1 𝑦 /|𝑥| 𝜏 = argmax𝑖(𝑔𝑖 = 𝑥 ) • Differential Average Lagging 1 |𝑦| 𝑖=1 |𝑦| 𝑔𝑖 ′ − 𝑖 − 1 𝑦 /|𝑥| 𝑔𝑖 ′ = 𝑔𝑖 , 𝑖 = 0 max 𝑔𝑖, 𝑔𝑖−1 ′ + 𝑦 𝑥 , 𝑖 < 0 where 𝑥=source, 𝑦=target, 𝑔=delays

Editor's Notes

  1. Simultaneous Translation is very useful in many applications.
  2. Offline: Best performance with 3 layers and 2 heads (6) MMA-H, improves in 1 layer with more heads MMA-IL: similarly to offline model. Best = 6 layers and heads (24) Latency, best performance = MMA-IL, 6 layers, 16 heads (96)